Generating Character Animation for the Apex Game Engine using Neural Networks Implementing immersive character animation in an industry- proven game engine by applying machine learning techniques Master’s thesis in Computer science and engineering for a degree at MPIDE - Interaction Design and Technologies, MSc. JOHN SEGERSTEDT Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2020 Master’s thesis 2021 Generating Character Animation for the Apex Game Engine using Neural Networks Implementing immersive character animation in an industry-proven game engine by applying machine learning techniques JOHN SEGERSTEDT Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2021 i Generating Character Animation for the Apex Game Engine using Neural Networks Implementing immersive character animation in an industry-proven game engine by applying machine learning techniques JOHN SEGERSTEDT Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg © JOHN SEGERSTEDT, 2021. mentor: Andreas Tillema, Avalanche Studios Group initial supervisor: Marco Fratarcangeli, Department of Computer Science and Engineering substitute supervisor: Palle Dahlstedt, Department of Computer Science and Engineering examiner: Staffan Björk, Department of Computer Science and Engineering. Master’s Thesis 2021 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2021 ii Generating Character Animation for the Apex Game Engine using Neural Networks Implementing immersive character animation in an industry-proven game engine by applying machine learning techniques JOHN SEGERSTEDT Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract The art of machine learning, here using neural networks to map pairs of inputs to outputs, has been greatly expanded upon recently. It has been shown to be able to produce generalizable solutions within multiple different fields of research and has been deployed in real-world commercial products. One of these research areas in which regular scientific achievements are made is game development, and specifi- cally character animation. However, compared to other fields, even though there has been much work on applying machine learning techniques to character anima- tion, few efforts have been made to applying them in real-world game engines. This thesis project aimed to research the applicability of one such piece of previous work, within the proprietary Apex game engine. The final results included an in-engine solution, producing character animation purely from a predicative phase-functioned neural network. Additionally, several different network configurations were evalu- ated to compare the impact of using, for example, a deeper network or a network that had trained for a longer period of time, in an attempt to investigate potential improvements to the original model. These alterations were shown to have negligible positive impacts on the final results. Also, an additional network configuration was used to investigate the applicability of this approach on an industry-used skeleton, producing promising but imperfect results. Keywords: machine learning, phase-functioned neural network, locomotive character animation, Avalanche Studios Group, Apex, thesis iii iv Generating Character Animation for the Apex Game Engine using Neural Networks Implementing immersive character animation in an industry-proven game engine by applying machine learning techniques JOHN SEGERSTEDT Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Acknowledgements First and foremost, the author wants to show their appreciation for Andreas "Andy" Tillema for being an incredibly supportive mentor who has, especially during the later stages of this project, been a pillar of support. Without Andys guidance, this project would not have been possible. Secondly, the two supervisors Marco Fratarcangeli and Palle Dahlstedt deserves thanks for their support and feedback; Marco for initial project scope and Palle for aiding in the process of turning a programming implementation into an actual thesis. Thirdly, there have been multiple Avalanche employees that have offered their aid to this project. Some of these that deserve praise include Robert "Robban" Petterson, for helping with the .bvh retargeting, and Preeth Punnatjanath, for helping with .bvh to 3D model skinning. The later of which, able to provide aid on a short notice, even when tackling other deadlines. John Segerstedt, Gothenburg, June 2021 v vi Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Network Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 The Mculloch-Pits Neuron . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Supervised Learning of a Neural Network . . . . . . . . . . . . 7 2.1.4 Underfitting and Overfitting . . . . . . . . . . . . . . . . . . . 8 2.1.5 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.6 Adam Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Phase-Functioned Neural Networks for Character Control (2017) 12 2.2.2 Mode-Adaptive Neural Networks for Quadruped Motion Con- trol (2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Neural State Machine for Character-Scene Interactions (2019) 13 2.2.4 Local Motion Phases for Learning Multi-Contact Character Movements (2020) . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 Learned Motion Matching (2020) . . . . . . . . . . . . . . . . 14 2.3 File Types & Software Libraries . . . . . . . . . . . . . . . . . . . . 15 2.3.1 The .bvh filetype . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Eigen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Methodology 17 3.1 To Answer the Research Questions . . . . . . . . . . . . . . . . . . . 17 3.1.1 Researching Responsivity . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Researching Accuracy . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Researching Architecture . . . . . . . . . . . . . . . . . . . . . 19 3.1.4 Simple Difference Significance Evaluation . . . . . . . . . . . . 21 3.2 The Phase-Functioned Neural Network . . . . . . . . . . . . . . . . . 22 vii Contents 3.2.1 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 The Input Vector . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 The Output Vector . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.4 The Phase Function . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 The Full PFNN Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Generate Patches . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Generate Database . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.3 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Process 29 4.1 The Runtime Package . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Using the Runtime Package . . . . . . . . . . . . . . . . . . . 29 4.1.2 ProceduralAnimations . . . . . . . . . . . . . . . . . . . . . . 30 4.1.3 PFNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.4 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.5 Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.6 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.7 HelperFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.8 ErrorCalculator . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.9 Waypoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Result 35 5.1 Responsivity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Responsivity Visualizations . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Accuracy Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 Accuracy Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.6 Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.7 Skinning Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Discussion 49 6.1 Discussing the Research Questions . . . . . . . . . . . . . . . . . . . . 49 6.1.1 Discussing Responsivity . . . . . . . . . . . . . . . . . . . . . 49 6.1.2 Discussing Accuracy . . . . . . . . . . . . . . . . . . . . . . . 50 6.1.3 Discussing Architecture . . . . . . . . . . . . . . . . . . . . . . 52 6.2 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3.1 Integration Contextualization . . . . . . . . . . . . . . . . . . 54 6.3.2 Integration Placement . . . . . . . . . . . . . . . . . . . . . . 55 6.3.3 Implementation Expertise . . . . . . . . . . . . . . . . . . . . 55 6.3.4 Equipment Suitability . . . . . . . . . . . . . . . . . . . . . . 56 6.3.5 Neural Network Rigidity . . . . . . . . . . . . . . . . . . . . . 56 6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.4.1 Runtime skeleton retargeting . . . . . . . . . . . . . . . . . . . 56 6.4.2 Consistent world-axis orientations . . . . . . . . . . . . . . . . 57 6.4.3 Full pipeline integration . . . . . . . . . . . . . . . . . . . . . 58 viii Contents 7 Conclusions 59 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Bibliography 61 List of Figures 67 List of Tables 69 A Apendix I A.1 .bvh Interpolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III A.2 Filenames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII A.3 AVA skeleton joint names . . . . . . . . . . . . . . . . . . . . . . . . IX A.4 Responsivity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII A.5 Accuracy Data - Means . . . . . . . . . . . . . . . . . . . . . . . . . . XV A.6 Accuracy Data - Standard Deviations . . . . . . . . . . . . . . . . . . XVII A.7 Training Mean-Squared Error . . . . . . . . . . . . . . . . . . . . . . XIX ix 1 Introduction 1.1 Background The domain of machine learning techniques applied within video game development has been greatly expanded during the last few years with multiple AAA-level pub- lishers funding machine learning dedicated departments. Amongst others, these include Ubisoft La Forge funded by Ubisoft Entertainment [1], and SEED funded by Eletronic Arts Inc [2]. Additionally, there has been great achievements within academia on this topic, such as the research done at the University of Edinburgh by Sebastian Starke, Daniel Holden, and others [3]. Within the specific sub-domain of generated character animation, considerable achieve- ments in research has been made by some of the aforementioned entities, much of which published as recently as during the year 2020, see Section 2.2. One of the reasons behind the new additions of machine learning within character animation is the need of automation for managing potentially hundreds of thousands of animation clips; as the demand for variation and fidelity, and also more adaptive and life-like animations, has increased, there has been an exponential demand in the number of animations within newer game titles. This great escalation of the problem space can be exemplified when there is an expectancy of animations that adapt to external factors, such as uneven terrain. Otherwise, the lack of more context-specific animations may lead to the players’ sense of immersion being broken. By using scalable and context-free machine learning techniques to generate more environmentally feasible animations, players may be kept more emerged into the gameplay experience without having to manually link seemingly endless number of animation clips and states. As a result of the video game industry’s immense size, there is a myriad of stakehold- ers to potential revolutionary and commercially viable innovations using the newly emerging techniques within machine learning; amongst others within research, de- velopment, publishing, and consumption, of video games. Within the context of this thesis, the video game development studio Avalanche Studios Group [4] is a direct stakeholder to the outcome of this project as a result of their direct collaboration with the project. 1 1. Introduction 1.2 Research Problem The aim of this master’s thesis is to investigate the possibility of generating real- istic character animations using a predicative neural network trained on previously captured animation data. This is to be achieved utilizing phase-functioned neural networks (PFNN), based off of previous research by Holden et al. [5]. Additionally, this master thesis aims to contribute to the research field by developing an implementation solution within the Apex Game Engine [6], contrary to previous research. This proprietary game engine will be provided by the Avalanche Studios Group [4] for use within this thesis. 1.3 Research Question Main research question: • How can the applicability of a Phase-Functioned Neural Network approach for generating real-time locomotive character animation in modern game engines be further improved? To answer the wicked problem that is the main research question, this thesis aims to investigate the following subsidiary questions: • Responsivity - How much computational time is required for procedural single- character locomotive animations, in a industry-proven game engine, on consumer- grade hardware? • Accuracy - How accurate are the generated locomotive character animations, in an industry-proven game engine, to the original animation data? • Architecture - How can the phase-functioned neural network architecture pre- sented in Holden et al. be improved? Additionally, analysis and comparisons will be made between the quantitative results during the responsivity and accuracy research of the following networks, see Section 3.1.3: • Holden - Default • Holden - Extra Trained • Holden - Extra Layer • Avalanche Also, a summary of some of the most important learnings produced by this thesis project will be written, see Section 6.3. 2 1. Introduction 1.4 Scope network type As of its simplicity, which is further discussed in Section 2.2, a phase-functioned neural network was selected as the chosen network architecture for this project. network architecture To answer the subsidiary research question regarding network architecture, and as a clear delimitation of the scope of this thesis project, comparisons will be made solely between the list of network archetypes listed in Section 1.3. previous research Since the phase-functioned neural network architecture was developed by Holden et al. at the University of Edinburgh [5], the publicized network pipeline and accompa- nying motion data will be used as a basis for this project. This previous research is based on a left-handed world-axis orientation, see Section 3.3.4. The motion capture data is of the .bvh filetype, see Section 2.3.1. gait styles To limit the space of character animations, only standard bipedal locomotive char- acter animations are to be considered. Therefore, only animations such as walking, jogging, crouching, and strafing, are to be considered. Similarly, interactions with advanced terrain and environments, such as balancing on elevated narrow beams and dynamic crouching beneath low ceilings will not considered. As such, these movement styles will be obsolete from the responsivity evaluation, see Section 1.3. However, motion capture files associated with these movements will still be part of the data set for the evaluation process, see Section 1.3. game engine This thesis will evaluate the feasibility of procedural animations only within the Apex game engine, provided by Avalanche Studios Group. This engine uses a right- handed world-axis orientation, see Section 3.3.4. confidentiality As a result of the Apex game engine being proprietary software solely used in-house, a certain level of discretion is required by Avalanche Studios Group. This includes, but is not limited to, potential omission of engine-specific details from the final report. phase function computation Out of the three methods of computing the phase variable, as presented in Holden et al. [5], only ‘constant’ is to be considered during this thesis project. This, in an effort to reduce the number of permutations of network settings. terrain and inclination The original demonstration application produced by Holden et al. [5] uses a static heightmap for terrain height sampling. As this is not the case for the Apex game engine, the phase-functioned neural network implementation integration as part of this project will assume a fully flat terrain during runtime. 3 1. Introduction hardware The entire evaluation process, and the network training, will be limited to be per- formed on a single set of hardware specifications: • CPU - Intel i7-8700k @ 3.70GHz • GPU - NVIDIA GTX 1060 6GB • RAM - 16.0GB 4 2 Theory 2.1 Artificial Neural Networks This section provides an introduction to artificial neural networks, the machine learn- ing model used for mapping input features to output targets by updating network weights and biases. The concept of an artificial neural network is based off the human brain; given a sensory input, and as internal energy levels surpasses specific thresholds, synapses are fired between neurons connected in a graph network. A neural network can be trained to model arbitrary input-output relationships using either: • Supervised learning - Comparing network outputs to a ground truth; for an example, used for image and speech recognition. • Unsupervised learning - Attempts to minimize a given error measure as no specific ground truths are given; for an example, used for clustering and clas- sification. • Reinforcement learning - Traverses the solution space by being provided in- termediary encouragement and punishment given specific state spaces; for an example, used for self-driving vehicles. For this thesis project, and for the rest of this theory segment, supervised learning is the considered context. 2.1.1 Network Layers A neural network can be modelled as a feed-forward directed acyclical graph with multiple connected layers, see Figure 2.1. When a neural network is given sensory input, this data is feed into the input layer. Then, through evaluating the outputs of each layer given its predecessor’s, see Section 2.1.2, the resulting evaluation of the network is produced in the output layer. A network can have any number of hidden layers between the input and the output layers, and any different number of nodes in each layer. 5 2. Theory Figure 2.1: Simplified fully connected neural network model Within the research field of machine learning, there are different network architec- tures consisting of other types of network layers than the fully connected layer type shown in Figure 2.1. Some other network types and examples of their usage are: • Recurrent Neural Networks - By allowing for cyclical node connections, infor- mation is able to be passed and remembered between iterations. Used in text recognition and translation, amongst other fields. • Convolutional Neural Networks - Through using feature convolving kernels that sequentially read subsets of the input, pattern recognition can be per- formed independent of the location of that pattern within the input data. Used in image recognition, amongst other fields. 2.1.2 The Mculloch-Pits Neuron Named after its founders, the smallest component of a neural network is the single Mculloch-Putts neuron [7]. Such a neuron, see Figure 2.2, produces an output given an internal threshold and the weighted inputs of other neurons. Figure 2.2: Simplified Mculloch-Pitts neuron model Consider Equation 2.1; to evaluate the output signal of a Mculloch-Pitts neuron, one firstly considers the local field b (L+1) i and inputs it into an activation function g. Popular activation functions include the ReLU (= max(0, bi)) and the Sigmoid 6 2. Theory (= 1 1+e−bi ) functions [8]. S (L+1) i = g(b(L+1) i ) = g( ∑ j w (L+1) ij S (L) j − θi) (2.1) where: • g is the activation function of the neuron • wij is the weight scalar from neuron j to neuron i • S (L) j is the output of neuron j in the previous layer L • θi is the bias/threshold of neuron i 2.1.3 Supervised Learning of a Neural Network By adjusting the weights and biases, a network can map any given set of input features X to any specific output Y . These are often referred to as pairs of input and output vectors. To achieve this, a training process such as the following is performed: Simplified training algorithm of a neural network 1. Split the data into two sets: training and validation 2. Initialize the network with random weights and biases 3. Use backpropagation to train the network using the training data according to the algorithm below (= an ‘epoch’) 4. Evaluate the accuracy of the network by performing complete prediction of all data points in the validation set, see Section 2.1.5 5. If the validation accuracy is increasing, according to some heuristic, go to step 3. Otherwise, terminate training (= ‘early stopping’), see Section 2.1.4 Backpropagation is the technique of using the chain rule [9] to compute the update for weights backwards through the network, using the computed error in the output layer. This is done by performing the following algorithm: Backpropagation algorithm for a neural network 1. Forward propagate the input through the network 2. Calculate the output error using the difference to the ground truth 3. Propagate the errors back through the network 4. Update the weights using the backpropagated errors 7 2. Theory 5. Update the biases using the backpropagated error equivalent to: 1. S(L) i ← g(∑j w (L) ij S (L+1) j − θ(L) i ), for all neurons i in layers L 2. δ(O) k ← g′(b(O) k )(yk − S(O) k ), for all neurons k in output layer O 3. δ(L) i ← ∑ j δ (L+1) i w (L+1) ij g′(bL) j ), for all neurons i in non-output layers L 4. w(L) ij ← w (L) ij + ηδ (L) j S (L−1) i , for all neurons i in layers L 5. θ(L) i ← θ (L) i − ηδ(L) i , for all neurons i in layers L where η ∈ (0, 1) is the learning rate of the network and which may decay during the training process. The learning rate, and other parameters such as number of epochs or the batchsize, are collectively referred to as hyperparameters. 2.1.4 Underfitting and Overfitting When training a neural network, or when performing other types of regression model fitting, the choice of model complexity may give rise to issues as a result of a com- plexity level too low or too high. Consider Figure 2.3, here one can observe the same data points and three different regression models. The optimal model for these types of problems is that which can most accurately represent the data distribution, and therefore can most precisely predict future data points belonging to the same data set. In the figure, the left model suffers from underfitting, wheras the right model suffers from overfitting. Figure 2.3: Simplified example of under/overfitting in 2D regression Left: Underfitting, model fails to accurately represent data. Middle: Optimal fit, model mimics the sampled distribution. Right: Overfitting, model fails to generalize observations In the previous example shown in Figure 2.3, the polynomial degree of a regression model was shown to be directly relating to any potential underfitting or overfitting. However, when it comes to neural network training, the complexity of the model is already decided upon previous to the training session. As such, the analogous parameter for neural network training is instead the number of training iterations. 8 2. Theory Figure 2.4: Simplified example of early stopping As the prediction error on the training data reduces during the training process, the prediction error on the validation data initially decreases. After some time the prediction error starts to increase as generalizability is lost. Consider Figure 2.4; here a neural network is trained repeatedly using a training data set of input and output vector pairs, see Section 2.1.3. A trivial expectancy of such a training process is that the prediction error on the training data set steadily declines as the training, often counted in the number of epochs, proceeds. However, to prevent the potential loss of generalization of the network, a separate validation data set is used. The network is at no point allowed to learn from, or update its network weights and biases in response to, being shown the validation data set. Instead, this data set is solely used to estimate the prediction qualities of the network given unseen data. In other words, the calculated error of the network predictions on the validation data set is considered to be proportional to the generalizability of the network. As such, the ideal performance of the network is when the validation error reaches a minimum, at which the training process should terminate. This is visualized in Figure 2.4, where halting training before the validation error starts to increase leads to potential underfitting, and halting post this minima leads to overfitting. 2.1.5 Gradient Descent The backpropagation algorithm presented in Section 2.1.3 attempts to minimize the prediction error by updating each weight and bias with regards to the function derivative of the error function with respect to each variable respectively. Conceptually, one can visualize this process by using Figure 2.5, which shows how the prediction error of a neural network is directly dependent on the values of the weights and biases. Then, as a training process includes initializing random weights and biases, consider a marked spot at a random starting point on the line. From there one, through gradient descent during the network training process, that marked spot will move downhill as the network weights and biases are updated. 9 2. Theory Figure 2.5: Simplified example of prediction error depending on weights/biases This model is heavily simplified. In actuality, a more realistic representation has one dimension for each weight value and for each bias value. However, through using true gradient descent on an unknown error landscape, one might get stuck in local minimas. This can be visualized in Figure 2.5 if one were to initialize a network configuration close to the noted local minima. To avoid this serious issue, one introduces even more randomness to the network training procedure. As such, if we allow the network solution to occasionally move in an opposite direction of the error function gradient, a weights and biases configuration may escape a local minima. This introduction of further randomness in the network training process can be done by updating the weights and biases after only having seen a randomly chosen subset, called a batch, of the training data set. This is called batch training and the technique of adding further randomness to the model training is referred to as stochastic gradient descent. 2.1.6 Adam Optimizer The Adam optimization algorithm [10] [11] is an extension to stochastic gradient descent, see Section 2.1.5, first presented in 2015, aiming to increase efficiency during neural network training. The most impactful difference between the Adam optimization algorithm and stan- dard stochastic gradient descent is where the latter uses a single learning rate for all trainable parameters, the former uses individualized learning rates for each pa- rameter that may decay individually. This decay is controlled through the hyper- parameters beta1 and beta2. 10 2. Theory 2.2 Related Work Recently, advancements in machine learning using deep neural networks has been made in a number of various fields: Since ImageNet [12] 2015, the annual image recognition algorithm competition, com- petitors have been able to produce machine learning solutions that outperform hu- mans in classifying photographs of objects [13]. In 2020, the AlphaFold agent developed by the Deepmind team, funded by Google, was able to make accurate predictions of protein shapes based off its sequence of amino acids, potentially beeing able to "accelerate research in every field of biology" [14]. The tech giant Google is using machine learning techniques for a multitude of their online services, such as: busyness metrics for public areas, individual passage index- ing on webpages, song/music identification, breaking news detection, and language translation [15] [16]. Within gaming specifically, there has been multiple recent breakthroughs: In 2016, the machine learning agent AlphaGo, produced by the Deepmind team, defeated Lee Sedol, winner of 18 world titles, in the board game of Go [17]. The achievement lead to AphaGo officially being the first ever computer agent being rewarded the highest ranking certification within the sport; 9 dan [17], a large step forward from the narrow matches between IBM’s Deep Blue and Garry Kasparov in the much less complex game of Chess in 1997. In a similar vein after AlphaGo’s triumph in Go, computer games has become a new focus for deep learning agents. One of these is AlphaStar, also developed by Deepmind, which in 2019 became the first AI to achieve the highest ranking within a widely popular esport title without any game restrictions in the computer game of StarCraft II [18]. Another such agent is OpenAI Five, developed by the OpenAI team and funded amongst others by Elon Musk, which also in 2019 achieved both expert-level per- formance and human-AI cooperation in the computer game of Dota 2 [19]. However, world class competition is not the only use for neural network agents. Mul- tiple games have officially launched with either fully, or partially, machine learned agents, such as the A.I. that players can play against in the game Planetary Annihi- lation [20]. Another example, where machine learning is only partially used, is the threat response, the fight-or-flight reaction, of the A.I. agent in the game Supreme Commander 2 [21]. Additionally, similar agents have been developed to the benefit of the game designers and developers, for reasons such as gameplay balancing and strategy win predictions [22]. Other uses of machine learning techniques to increase productivity, generalizability, or efficiency of the game development process include the field of procedurally gen- erated game content. Applicable areas were machine learning techniques have been 11 2. Theory applied such research include: level design, text and narration, music and sound effects, model textures, and character animations [23]. Some other specific examples of applied machine learning applied during game de- velopment is much of the research done at the machine learning research division Ubisoft La Forge [1] who have conducted research on, amongst others; 3D charac- ter navigation [24], motion in-betweening for character animations [25], data-driven physics simulations [26], automatic code bug detection [27], and motion capture data denoising [28]. As this report is on the topic of neural network generated locomotive character an- imations, the rest of this section is dedicated to presenting different advancements within this field and to, where relevant, relate their respective strengths and weak- nesses in contrast to that of phase-functioned neural networks. The following research is presented chronologically. 2.2.1 Phase-Functioned Neural Networks for Character Con- trol (2017) In this paper, Holden et al., at the University of Edinburgh, introduces a single net- work architecture for generating locomotive character animation over rough terrain using a phase function [5]. At each frame during runtime, this neural network takes as input the current pose of the character, any potential user input, and information of specific sample points along the ground ahead and behind the character, to produce the next character pose. This character pose consists of information regarding each joint within the character model, such as their position and orientation. To accurately model the cyclical nature of the human walk, a phase variable produc- ing function is introduced. This variable models the transitions between the contact of each of the two feet of the character with the ground. This ensures a perpetual forward animation of the character, stepping with each foot sequentially. For a more in-depth presentation of the phase-functioned neural network used in this report, see Section 3.2. 2.2.2 Mode-Adaptive Neural Networks for Quadruped Mo- tion Control (2018) This paper, authored by researchers at the University of Edinburgh, introduces a dual neural network system for generating locomotive character animation for quadrupeds [29]. The first of these two networks is the motion prediction network, highly similar to the neural network used in a phase-functioned neural network, see Section 3.2. The second, a gating network that outputs blending coefficients that are used as inputs in the motion prediction network similar to the phase variable of a phase-functioned 12 2. Theory neural network, see Section 3.2. By controlling the motion weights of the motion network using another generative neural network, one trades manually labeling the phase of the motion training data for the requirement of training this separate gating network. This, however, is necessary for quadrupeds as the cyclical leg movement of these is heavily dependent on gait styles and cannot be modelled with a single phase variable [29]. Result-wise, the mode-adaptive neural networks approach achieve more realistic quadruped motion for flat terrain than that of a phase-functioned neural network [30]. However, direct comparisons in memory footprint or computational complexity, for either flat or rough terrain, was omitted from the original paper. 2.2.3 Neural State Machine for Character-Scene Interac- tions (2019) Compared to other research presented here, this paper, also from the University of Edinburgh, specifically focuses on the interaction between a character and scene objects, such as opening doors, sitting on chairs, and lifting and carrying boxes [31]. For the training data, specific motion capture clips of the supported object interac- tions were recorded and a set of control points were manually labeled, such as the armrests of an interacted with chair. A data augmentation scheme was then used to generate a 16GB training set from the initial data. At runtime, the state machine can transition and blend between animation modes such as walk, sit, open, carry, etc, triggered by user input. Additionally, this system is fed not solely the pose of the character and specific control points along the trajectory of the character, but also the geometry of the nearby surroundings through voxelization. Contrary to the neural state machine which was built for this specific purpose, a phase-functioned neural network produces unsatisfactory and jittery results when at- tempting to produce character animation of scene object interactions [32]. However, for strictly locomotive tasks, a phase-functioned neural network produced compara- ble results in areas such as foot sliding and response time [31]. Additionally, the neural state machine presented in this research includes two differ- ent conjoined neural networks, similar to the system structure presented in Section 2.2.2, and one of which includes a phase variable similar to that of the compared to solution. Also, the comparatively massive training data adds considerably longer training time than that of a phase-functioned neural network [31]. 2.2.4 Local Motion Phases for Learning Multi-Contact Char- acter Movements (2020) This paper, authored by researchers at Electronic Arts and the University of Edin- burgh, presents the concept of local motion phases and shows successful applications 13 2. Theory of this concept both within new fields of animations and within the context of pre- vious research from the university [33]. The local motion phase is conceptually similar to the phase variable in a phase- functioned neural network, see Section [5]. However, rather than modelling the locomotive movement of an entire character with a single phase variable, local motion phases are automatically calculated and maintained on a per-bone basis. This allows for more realistic animations during highly detailed movements than that which can be generated by a phase-functioned neural network [34]. At its core, the system presented in this paper has a similar dual network structure to those mentioned in Section 2.2.2 and Section 2.2.3. However, in addition to the inclusion of the local motion phases, this paper introduces an autoencoder for the user input. This generative control model encodes and decodes user input at runtime, pretrained on the motion capture data. This approach has been shown to produce high quality animations for tasks such as lifting boxes similar to those presented in Section 2.2.3, playing basketball, and for quadruped movement similar to those presented in Section 2.2.2. However, as this approach builds upon multiple advanced concepts, its final network structure is substantially more intricate than that of the phase-functioned neural network. 2.2.5 Learned Motion Matching (2020) Traditional motion matching [35] consists of a system that regularly fetches the most appropriate pre-processed animation from an animation database, given a set of pose features, for a specific character. Such a database consists of highly structured and non-overlapping directional movement, as to minimize the number of recorded animation sequences. This, as the memory footprint of a motion matching system scales linearly with the size of the database as the latter must be kept fully in memory during runtime. Learned motion matching [36], however, is a technique presented by Ubisoft La Forge [1] that introduces a neural network approach to motion matching that removes direct dependency on an animation database. Performance-wise, although a learned motion matching system is able to produce indistinguishable results to that of a traditional motion matching system with a substantially smaller memory footprint, it does so requiring considerably longer computational time [36]. Compared to using a phase-functioned neural network, a learned motion matching network uses slightly less memory and significantly less computational time at run- time while requiring an incredibly shorter training period [36] and while being able to produce similar or better results [37]. Although, a learned motion matching system does not require phase-labeling, it instead demands to be trained on a meticulously constructed database of encom- passing motion capture data. Additionally, compared to a phase-functioned neural network, a motion matching system requires three distinct networks, each with their own features, targets, and error functions. 14 2. Theory 2.3 File Types & Software Libraries This section aims to present relevant file types and software libraries used within this project, and to provide a brief introduction on how to use them. 2.3.1 The .bvh filetype The .bvh, Biovision Hierarchy, filetype can be used to store motion capture data. These are human-readable text files, containing both a structural definition for the motion capture joint skeleton and the per frame joint data. HIERARCHY ROOT Hips { OFFSET 0.000000 0.000000 0.000000 CHANNELS 6 Xposition Yposition Zposition Zrotation Yrotation Xrotation JOINT LeftLeg { OFFSET 1.000000 -1.000000 1.000000 CHANNELS 3 Zrotation Yrotation Xrotation End Site { OFFSET 0.000000 0.000000 0.000000 } } JOINT RightLeg { OFFSET -1.000000 -1.000000 1.000000 CHANNELS 3 Zrotation Yrotation Xrotation End Site { OFFSET 0.000000 0.000000 0.000000 } } } MOTION Frames: 3 Frame Time: 0.008333 2.801100 17.851100 -0.421913 -0.943466 0.030603 6.685755 -1.889587 17.864721 4.969343 -14.847416 -7.065584 -13.249440 1.025640 2.800815 17.848850 -0.421355 -0.932144 0.000126 6.685797 -1.904051 17.868369 4.931144 -14.870445 -7.095567 -13.284463 1.082333 2.800560 17.846750 -0.420267 -0.919127 -0.037791 6.682931 -1.919381 17.879059 4.896097 -14.894115 -7.119117 -13.317921 1.137596 Figure 2.6: Simplified .bvh file example Consider the simple .bvh example in Figure 2.6. Firstly, a ‘HIERARCHY’ of skeleton nodes are defined by name and parent offset. Additionally, each joint has a number of ‘CHANNELS’ associated with them. In this example, a skeleton of three joints is defined: a parent ‘Hips’ joint with the two children joints: ‘LeftLeg’ and ‘RightLeg’. Lastly, a .bvh file features a ‘MOTION’ section where the per frame motion captured data is listed. In order, each floating point value here corresponds to one of the ‘CHANNELS’ specified in the previous section. Each new row of data corresponds to a new frame. Joint orientations are stored in degrees, within the interval (−180, 180]. Software, such as Blender, can import .bvh files and render the motion applied to the skeleton, as defined within the .bvh. 15 2. Theory 2.3.2 Theano Theano [38] is a Python library built on top of Numpy [39] for efficient multi- dimensional array computations on the GPU. Theano was initially released in 2007 and further development on the project was shut down in 2017 [40]. Usage of Theano is done by creating ‘theano.function’:s that specify both input/output parameters and the actual operation to perform. For a simple example of Theano code, see Figure 2.7. However, Theano can be used for much more complex compu- tations, such as neural network training, by passing the error function and learning rate update to the ‘theano.function’ function. import theano from theano import tensor a = tensor.dscalar() b = tensor.dscalar() c = a + b f = theano.function([a,b], c) print(f(0.5, 1.5)) Figure 2.7: Simplified Theano code snippet for addition on the GPU The code snippet is expected to print ‘2’ to the console after performing an addition of the scalars 0.5 and 1.5 on the GPU. 2.3.3 Eigen Eigen [41] is a C++98 library used to perform high-speed matrix and array oper- ations. Eigen was first released in 2006 and has since been used in the creation of other software libraries, such as TensorFlow [41] [42]. In Eigen, one can define both statically-sized and dynamically-sized matrices and arrays. However, arithmetic inter-data type operations between matrices and arrays are not allowed, requiring users to cast objects between the types at runtime. For a simple example on how to use Eigen arrays to represent a single-layed neural network, see Figure 2.8. Eigen::ArrayXf W0; Eigen::ArrayXf b0; Eigen::ArrayXf Y; void PerformNetworkPrediction(Eigen::ArrayXf X){ Y = (W0.matrix() * X.matrix()).array() + b0; } Figure 2.8: Single-layered network implementation using Eigen Note: this example requires variable initialization before use of the ‘PerformNet- workPrediction()’ function. 16 3 Methodology 3.1 To Answer the Research Questions After the neural network solution has been fully integrated into the Apex game engine, its viability for generating locomotive character animations will be quanti- tatively evaluated in the following two ways. 3.1.1 Researching Responsivity Responsivity will be measured in computational time required, per frame, during runtime for the network related code. This will be tested for locomotive character animations around a static obstacleless track course, as to ensure deterministic user input. This obstacle course will be defined using waypoints, see Section 4.1.9, that both acts as positional checkpoints along the course and which dictates what movement style the character is expected to produce while traversing the environment toward the waypoint. The track course will be defined using the following waypoints, see Table 3.1: Index Pos (m) Gait Speed StrafeDir. 0 (-30, 40) Walk 2.5 - 1 (-50, 0) Walk 2.5 (-0.5, -0.5) 2 (-25, -25) Jog 10.5 - 3 (25, -40) Jog 10.5 - 4 (10, -10) Crouch 2 - 5 (25, 50) Walk 2.5 (0, -1) Table 3.1: Track course details Pos = the position of the waypoint in meters in world space Gait = gait type, see Holden et al. [5] Speed = goal root speed StrafeDir. = normalized character facing direction vector when strafing The final responsivity result will include the statistic for mean and standard devia- tions of frametime on a per lap basis of 19 laps. The in-engine representation of the 17 3. Methodology track course can be seen in Section 5.2. The reason behind the choice of evaluating responsivity specifically, is how high responsiveness is a requirement for both immersive and interactive non-passive ex- periences, such as computer games, and that it that can be evaluated quantifiably. Additionally, low responsiveness may influence both player enjoyment and perfor- mance when playing computer games, interrupting a possible state of flow [43]. 3.1.2 Researching Accuracy The accuracy of the integrated network solution will be measured by comparison between the predicted output pose and the corresponding ground truth for the entirety of the training data. To allow for this comparison, pairs of input and output vectors will be generated similarly as those during the database generation process, see Section 3.3.2, and will be evaluated as part of the final runtime package, see Section 4.1.8. There are multiple different error definitions used within regression, such as mean- squared error (MSE), root mean-square error (RMSE), and mean-absolute error (MAE) [44]. The error definition to be used as part of the accuracy evaluation in this report will be mean-absolute error. This was chosen as firstly, mean-absolute error is more forgiving for outliers which may be expected in this particular data set, and secondly, mean-squared error is already used as part of the training process. A different error calculation for the evaluation process than what is used during training may be useful for testing generalization and to allow for comparisons between the two errors. The error will be presented in both mean and standard deviation on a per-file basis, evaluated through a per-frame calculation according to the following mean-absolute error formula: 1 |j| ∑ j |tj − pj| tj (3.1) Where |j| is the total number of frames in this file, tj is the three-dimensional position of joint j as defined in the motion capture database, and where pj is the three-dimensional network predicted output position of joint j. The values taken from the motion capture database, including tj, is referred to as the ground truth. As this definition includes the tj denominating term, an error evaluated using the formula can be interpreted as the relative prediction error in percentage. In other words, an error of 0.01 equals an average joint position error of 1%. Additionally, as a result of restrictive system memory, the maximum number of frames considered in each motion capture data file is that which equals at most 500’000 discrete joint positions. In other words, for a skeleton with 191 joints, only the first 2’617 unique frames will be considered. 18 3. Methodology 3.1.3 Researching Architecture To answer the subsidiary research question regarding architecture optimality, com- parisons will be made between the results of the different network configurations, as presented in Section see Section 1.3. This analysis is to be done by comparison of the evaluation results, as presented in Section 3.1.1 and Section 3.1.2, between the following neural network configurations: • Holden - Default (HOLDEN) - The default network solution as presented in Holden et al. [5]: a phase-functioned neural network with a single hidden network layer of 512 nodes, trained for 2’000 epochs 3.3.3. • Holden - Extra Layer (HOLDEN-XL) - The default network solution as pre- sented in Holden et al. [5] but using two hidden network layers of 512 nodes each. • Holden - Extra Trained (HOLDEN-XT) - The default network solution as presented in Holden et al. [5] but trained for 4’000 epochs. • Avalanche (AVA) - The default network solution as presented in Holden et al. [5] but heavily altered to accompany an in-house skeleton. The reasoning behind these specific choices in network configurations were that firstly, there must be a control case network that mimics the original implementa- tion. Then, having a longer network, or a network that is trained for a longer period of time, could be used for conceptually straightforward comparisons. Additionally, evaluating a network with a longer trained process would also be interesting to inves- tigate whether the original implementation by Holden et al. suffers from overfitting, see Section 2.1.4, given that that implementation uses no validation data or early stopping 2.1.4. the holden configurations The hyperparameters that will be used for the network configurations are directly based on the work by Holden et al. [5], see Section 3.2. Additionally, the 31 joint .bvh, see Section 2.3.1, skeleton used for these configura- tions is that of the original .bvh files made public by Holden et al. [5], see Figure 3.1. The HOLDEN and HOLDEN-XT networks will both have an input layer width of 342, a hidden layer width of 512, and an output layer width of 311. The extra hidden layer present in the HOLDEN-XL network configuration will also have 512 nodes. 19 3. Methodology Figure 3.1: Visualization of the .bvh skeleton used by the Holden configurations This is the same skeleton as presented by Holden et al. [5]. the avalanche configuration The altered AVA network will general use the same network hyperparameters as that of the Holden configurations. However, it will use a different skeleton, one that is used in a live Avalanche product. This skeleton is visualized in Figure 3.2. Figure 3.2: Visualization of the .bvh skeleton used by the AVA configuration The ‘extra’ joints that appear outside the character body are used for tasks such as deformation, object interactions, and player camera locations. Do notice the many extra, compared to in Figure 3.1, joints in the character head and hands. For the complete list of joint names in this skeleton, see Appendix A.3. To accommodate for the AVA skeleton having 191 joints, rather than the 31 of the Holden configurations, the Avalanche network will have a input layer width of 1’302, 20 3. Methodology the same hidden layer width of 512, and an output layer width of 1’751. Also, since this configuration aims to use an in-house Avalanche skeleton, the .bvh files made public by Holden et al. [5] will need to be re-generated. This retargeting step will be done by professionals employed at Avalanche. Additionally, since the network is trained for 60Hz predictions, whereas the retar- geted .bvh files was retargeted to the in-house standard of 30Hz, these .bvh files will need to be interpolated, see Figure A.1 and Figure A.2 in Appendix A.1. As a consequence of the greater number of joints, the network training database used for the AVA configuration will only include every fourth motion capture frame. This, as otherwise the training database does not physically fit in the runtime mem- ory of the system used as part of this thesis, see 1.4. To reemphasize: the AVA configuration will be trained on a fourth of the number of motion capture frames than that of the Holden configurations. However, each frame in the Ava training database will contain data of more than six times the joints than in the Holden training database. This issue, however, could have been resolved by rewriting the network training logic such that the network could onload, and offload, parts of the training database. This would lead to the network being able to indirectly train on the entire data set, including all movement frames, even though the database would be too large to fit in system memory at once. However, this procedure would require an extensive rewrite of the original implementation by Holden et al., and this would potentially drastically increase the time required during the training process as onloading and offloading such large chunks of memory is a slow process. Also, a specific subset of the joints most equivalent to those of the Holden skeleton used in the Avalanche skeleton will be referred to as the Avalanche Masked skeleton: AVA-M. In other words, the AVA-M skeleton are the subset of Avalanche joints most similar to those in the Holden skeleton, see Section 3.3.3. 3.1.4 Simple Difference Significance Evaluation To evaluate the statistical significance in the difference between two data sets, a and b, the following version of heuristic will be used: 2|mean(a)−mean(b)| sd(a) + sd(b) (3.2) This is equivalent to evaluating the difference between the means of the two data sets in measurements of the average of their respective standard deviation. The absolute difference is used here for the same reasons that mean-absolute error is used for the accuracy evaluation, see Section 3.1.2. 21 3. Methodology 3.2 The Phase-Functioned Neural Network A phase-functioned neural network, as presented by Holden et al. [5], is a neural network with weights generated by a cyclic phase variable produced by a phase function. This section aims to describe the functional components of a phase-functioned neural network within the specific context of this project, as presented in Holden et al. [45]. 3.2.1 Network Structure The network architecture used in Holden et al. [5] is a neural network with the following structure, where each network node uses a trainable bias: • H0 - Input layer of 342 nodes, see Section 3.2.2. • H1 - Fully-connected hidden layer of 512 nodes. • H2 - Output layer of 311 nodes, ELU [8] activation function, see Section 3.2.3. 3.2.2 The Input Vector The input vector xi, at frame i, is a concatenation of, amongst others; sample points on the terrain along the traversed and expected path of the animated character, see Figure 3.3, and the current joint positions and velocities of the character. xi = {tp i , td i , th i , t g i , j p i−1, jvi−1} ∈ Rn (3.3) where: • tp i ∈ R2t, the x, y positions of the sample points in character local space • td i ∈ R2t, the x, y trajectories of the sample points in character local space • th i ∈ R3t, the heights of each sample point and additional sub-sample points • tg i ∈ R5t, a vector containing the gait of the character along the sample points • jpi−1 ∈ R3j, the position of all j character joints in the previous frame j − 1 • jvi−1 ∈ R3j, the velocities of all j character joints in the previous frame j − 1 where: t is the number of sample points centered around, and including the at the feet of, the character. This value was set to 12 in Holden et al. [5], equaling five sample points ahead, and six sample points behind, the character. j is the number of joints within the character model. This value is was set to 31 in Holden et al. [5]. 22 3. Methodology Figure 3.3: Subset of PFNN input vector visualized a: sample point positions - tp i ∈ R2t b: sample point trajectories - td i ∈ R2t c: (sub-)sample point heights - th i ∈ R3t source: Holden et al. [46]. 3.2.3 The Output Vector Similarly, the output vector yi, at frame i, is a concatenation of both predicted future states, the next pose of the character, and an update of certain metadata. yi = {tp i+1, td i+1, j p i , jvi , jai , rx i , r z i , r a i , ṗi, ci, } ∈ Rm (3.4) where: • tp i+1 ∈ R2t, the predicted x, y positions of the sample points in character local space of the next frame i+ 1 • td i+1 ∈ R2t, the predicted x, y trajectories of the sample points in character local space of the next frame i+ 1 • jpi ∈ R3j, the generated position of all j character joints • jvi ∈ R3j, the generated velocities of all j character joints • jai ∈ R3j, the generated angles of all j character joints • rx i ∈ R, local character velocity in the relative x direction • rz i ∈ R, local character velocity in the relative z direction • ra i ∈ R, local character angular velocity around the world up vector • ṗi ∈ R, phase variable update delta • ci ∈ R4, binary contact labels of heel and toe joints with the ground 23 3. Methodology 3.2.4 The Phase Function The Phase function blends between four sets of network weights, αk0 ,αk1 ,αk2 ,αk3 , using cubic Catmull-Rom interpolation [47]. As such, the number of network weights needed to be stored in memory at runtime is multiple times that of a singular network configuration. The phase function Θ is evaluated: Θ(p; αk0 ,αk1 ,αk2 ,αk3) = αk1 +w(1 2αk2 − 1 2αk0) +w2(αk0 − 5 2αk1 + 2αk2 − 1 2αk3) +w3(3 2α k1 − 3 2αk2 + 1 2αk3 − 1 2αk0) (3.5) where: w = 4p 2π (mod 1) (3.6) kn = ⌊ 4p 2π ⌋ + n− 1 (mod 4) (3.7) Within this project, the phase function will be evaluated during runtime. An al- ternative approach would be to precompute the function and store its results in memory. This would reduce the computational load at runtime but increase the memory footprint [5]. 24 3. Methodology 3.3 The Full PFNN Pipeline Figure 3.4: The full PFNN pipeline ‘data’ = offline storage ‘script’ = runnable files ‘memory’ = temporary, runtime This section aims to provide an overview of the full phase-functioned neural network pipeline, as designed by Holden et al. [5] and as presented in Figure 3.4. The final integrated version of this model is presented in Section 4.1. 3.3.1 Generate Patches To allow for the generation of locomotive character animations that adhere to the roughness of the topography, the training data used later must include different types of terrain. A solution to this is to fit heightmaps to the separately recorded motion capture data, firstly producing intermediate patches of terrain. 3.3.2 Generate Database During this step, each input and output vector pair, see Section 3.2.2 and Section 3.2.3, is produced and stored. Each vector pair is created on a per-frame basis using motion captured data, see Section 2.3.1, and associated labels, such as the phase and gait variables. Additionally, for each motion capture clip, the ten most suitable heightmaps are fitted to the foot-to-ground contacts of the character. 3.3.3 Network Training Training will be performed using the Theano [38], a Python library for multi- dimensional array computations on the GPU - see Section 2.3.2, implementation by Holden et al. [5], and an Adam optimizer, see Section 2.1.6. The result of this step will be the finalized trained network weights. The default hyperparamters for the training will be: 25 3. Methodology • batchsize = 32 • learning rate = 0.0001 • beta1 = 0.9 • beta2 = 0.999 • epochs = 2000 • error function = mean-squared error For the order of the motion capture data files, see Table A.1 in Appendix A.2. During the training process, the translation and orientation of the joints not on the following list, or equivalent to these in the case of the Avalanche configuration, within the input vector will be put to ≈ 0, as is done in Holden et al. [5]: • Hips • LeftUpLeg • LeftLeg • LeftFoot • LeftToeBase • RightUpLeg • RightLeg • RightFoot • RightToeBase • Spine • Spine1 • Neck1 • Head • LeftArm • LeftForeArm • LeftHand • RightArm • RightForeArm • RightHand Additionally, this training process will not make use of the early-stopping technique, see Figure 2.4. 3.3.4 Neural Network This step includes the entire package necessary for runtime pose prediction. Dur- ing initialization, all necessary trained network weights will be read and loaded in memory. Then, each frame, a prediction request is passed to the package, provid- ing a character pose in the current frame and expecting an updated character pose as return value. In addition to the character pose, other metadata is feed to the network for prediction, such as sample points of the topography and user input, see Section 3.2.2. In this step is where the bulk of the integration work will be. However, the overall package structure will be based of the demonstration codebase made public by Holden et al. [5], with the neural network model defined in Eigen, see Section 2.3.3, arrays and matrices. Additionally, as mentioned in Section 1.4, the motion capture data, and therefore the trained neural network, uses left-handed world-axis, whereas the Apex engine uses a right-handed world-axis, see Figure 3.5. 26 3. Methodology Figure 3.5: Visual representation of left/right-handed world-axis orientations Left: Left-handed world-axis (green) Right: Right-handed world-axis (purple) For this reason, the runtime neural network package must be altered such that it can convert between the world-axis orientations. The character pose, living in a right-handed world-axis, is to be converted to the left-handed world-axis of the neural network. Then, the neural network outputted updated character pose must be converted back into right-handed world-axis before being applied to the character skeleton. 27 3. Methodology 28 4 Process 4.1 The Runtime Package This section aims to present the runtime package implemented for the phase-functioned neural network solution, originally based on the demonstration software made public by Holden et al. [5]. An overview of this package is presented in Figure 4.1. Figure 4.1: The Procedural Animations runtime package The neural network solution is accessible either from the ProceduralAnimations class, or indirectly through the ErrorCalculator class. 4.1.1 Using the Runtime Package The runtime package, see Section 4.1, is aimed to have a low level of coupling, such that other programmers need not to interact with, nor understand, the deeper machinations of the package. As such, to use the runtime package, a programmer would only need to perform two things: initialize the Procedural Animations class and to call ‘GetNextPose(...)’ when wanting to use the network for predictions, see Section 4.1.2. During initialization, the ProceduralAnimations constructor takes three optional parameters, see Figure 4.2: • new world transform - A 4D matrix for character scaling/rotation/translation. • new setting - A Setting enum, see Section 4.1.6, for network configurations. • new waypoint sptr - A pointer to a vector of Waypoint:s, see Section 4.1.9. 29 4. Process CProceduralAnimations( AosMatrix4 new_world_transform = AosMatrix4(0.0f), CSettings::SETTING new_setting = CSettings::HOLDEN, std::vector* new_waypoints_ptr = nullptr); Figure 4.2: Procedural Animations constructor Then, during the constructor execution, the objects that the ProceduralAnimation class owns are initialized. During runtime prediction, only the ‘GenerateNextPose(...)’ function is required, see Figure 4.3. This function takes two parameters: a pointer to the current character pose, and a pointer to the translational character-in-world offset. Then, the ‘Gen- erateNextPose(...)’ updates the two input parameters in place, given the outputs of the neural network. void GenerateNextPose(CPose* pose, AosVector3* translation_offset); Figure 4.3: Procedural Animations per frame prediction 4.1.2 ProceduralAnimations This class owns the pointers to the PFNN, Character, Trajectory, and Settings rep- resentations. As the ErrorCalculator class is intended only for evaluation purposes, the ProceduralAnimations class is the default way to access the phase-functioned neural network solution. Inside the ‘GenerateNextPose(...)’ function, see Figure 4.3, the flow of sub-function calls is organized as follows: 1. Prepare - Stores the input pose information in the Character object. 2. Input - Evaluates the Waypoint information and sets the Trajectory state. 3. Insert - Inserts the Character and Trajectory states into the input vector. 4. Predict - Runs the network prediction, setting the output vector. 5. Output - Stores the relevant output vector information in the return pose. 6. Update - Update Character and Trajectory states using output vector. The time required to perform these six steps is recorded e ach frame for use in evaluating the systems responsivity, see Section 3.1.2. Additionally, this class has debug rendering functionality for visually rendering net- work parameters, such as the joint skeleton, the sample points, character velocities, etc., in the engine. 30 4. Process 4.1.3 PFNN This struct holds the memory representation of the neural network and is responsible for the network prediction. When initialized, the PFNN struct loads the network weights and biases into Eigen, see Section 2.3.3, matrices in memory from stored .bin files. The .bin directory and network configuration is fetched from the Settings object. Additionally, the PFNN struct is the only part of the runtime package dependent on the Eigen library. During the prediction step, the PFNN struct performs the matrix multiplications necessary to propagate the input vector state, and then standardizes the result before storing it in the output vector data structure. 4.1.4 Character The Character struct stores the positions and translations, in model space, of all character joints in the current frame. Additionally, the same information is stored for the few previous frames to allow for output blending when setting the return pose values. 4.1.5 Trajectory Similar to the Character struct, the Trajectory struct holds all information regarding the sample points along the ground, see Figure 3.3, such as positions and velocities. These values are also stored between multiple frames to allow for output blending. 4.1.6 Settings The Settings class is used to manage easy switching between the different network configurations, see Section 3.1.3, which is represented as an enum passed to the constructor. To allow for a low level of coupling and extensibility, in the form of being able to add additional network configurations requiring minimal changes in the code base, the Setting class holds all data that may be affected by the choice of network configuration. In other words, if one wants to add another network configuration, one would only need to add support for it in the Setting class. For an example, all paths to the network .bins are defined in the Setting class. This means that when a PFNN object initializes, it simply calls something similar to ‘settings->GetWeightsPath()’, without needing any logic, e.g. switch cases, that requires the knowledge of a network configuration enum or how that configuration would affect this class. This is shown in Figure 4.4 31 4. Process class Settings{ enum CONFIG {HOLDEN, AVA}; string path; Settings(CONFIG new_config){ switch(new_config){ case HOLDEN: path = "/holden_weights/" break; case AVA: path = "/avalanche_weights/" break; } } string GetPath(){ return path; } } Figure 4.4: Simplified example of Settings implementation (DISCLAIMER: PSEUDO CODE! NOT ACTUAL IMPLEMENTATION!) 4.1.7 HelperFunctions This is a simple, fully static class that holds functions such as debug outprints and definitions for specific matrix operations. 4.1.8 ErrorCalculator When evaluating the network, rather than creating an instance of the Procedu- ralAnimations class, one initializes an ErrorCalculator instead. This object acts as a wrapper around a ProceduralAnimations instance and, rather than depending on an input pose, uses stored input and output vector pairs, see Section 3.2.2 and Section 3.2.3. This class is therefore responsible for calculating the evaluative results required in the answering of the research question regarding accuracy, see Section 1.3 and Section 3.1.2. This evaluation process can either be run immediately on initialization, or on a per frame basis to allow for visualization of the network prediction, compared to the ground truth. This is controlled with a ‘run-offline’ flag. Since the ErrorCalculator constructs an internal ProceduralAnimations instance, it also requires the same input parameters; both in the constructor, see Figure 4.5, and on the per frame prediction, see Figure 4.6. CProceduralAnimations( AosMatrix4 new_world_transform = AosMatrix4(0.0f), CSettings::SETTING new_setting = CSettings::HOLDEN, std::vector* new_waypoints_ptr = nullptr, bool run_offline = false); Figure 4.5: Error Calculator constructor 32 4. Process float CalculateError(CPose* pose); Figure 4.6: Error Calculator per frame prediction 4.1.9 Waypoint Each Waypoint instance is a simple datastructure, representing one checkpoint along the obstacle course that the characters will traverse as part of the responsivity research, see Table 3.1 in Section 3.1.1. In addition to its inherent world translation, each Waypoint holds information rep- resenting the goal movement style that a character aims to perform when reaching it. This includes the gait styles; walking, jogging, crouching, etc, but also movement speed and facing direction. This, in an aim to deterministically simulate user input during the evaluation process. The ProceduralAnimations instance keeps track of the current Waypoint index, and increments that number upon reaching the next checkpoint. 33 4. Process 34 5 Result 5.1 Responsivity Results All responsivity data, which is used to produce the figures and tables presented in this section, is available in Appendix A.4. For more information regarding the track course used, see Section 3.1.1. The responsivity results presented in Figure 5.1 shows the average frame time com- putation in milliseconds per lap around the course. The same data is presented as a box plot in Figure 5.2, and summarized in Table 5.1. Figure 5.1: Line chart of responsivity results 35 5. Result Figure 5.2: Boxplot of responsivity results HOLDEN HOLDEN-XL HOLDEN-XT AVA Mean 0.376 0.499 0.382 1.55 SD 0.0173 0.0161 0.0167 0.0430 Table 5.1: Mean and standard deviation results of responsivity evaluation Values are rounded to three significant digits. By combining the visual results of the line chart in Figure 5.1 and the box plot in Figure 5.2, one can conclude that there is a considerably sized difference in compu- tational time required for that of the AVA network configuration. A potential root cause of this is the great increase in number of joints for that network, see Section 6.1.1 for further discussion on this topic. For the Holden configurations, the results of HOLDEN and HOLDEN-XT have al- most perfect overlap in both Figure 5.1 and in Figure 5.2. As such, one can conclude that these two network configurations have practically equivalent responsivity. How- ever, this is not too surprising as, in theory, a network having trained longer, with otherwise the same hyperparameters, should only result in a different set of network weights. Subsequently, two otherwise equivalent networks but with different weights should still be evaluated at runtime at the same speed. Finally, for the HOLDEN-XL configuration, it is not as visually clear whether it at runtimes evaluates at a considerably different speed than that of the other Holden configurations. For this, the similarity metric defined in Section 3.1.4 can be used. This metric evaluates the absolute difference between each mean result, standardized by the average standard deviation of the two data series. In other words, the metric evaluates how many standard deviations two data points differ. 2|mean(a)−mean(b)| sd(a) + sd(b) (5.1) 36 5. Result • HOLDEN to HOLDEN-XT: 2|0.382−0.376| 0.0173+0.0167 ≈ 0.35 • HOLDEN-XL to HOLDEN-XT: 2|0.499−0.382| 0.0161+0.0167 ≈ 7.1 • AVA to HOLDEN-XT: 2|1.55−0.499| 0.0430+0.0161 ≈ 36 These calculations, together with the visualizations in both Figure 5.1 and in Fig- ure 5.2, can be combined to suggest the relative significance of the differences in standard deviations between the responsivity results. Even though the number of standard deviations between the results of the HOLDEN-XL configuration and that of the HOLDEN-XT results are much smaller than that to the results of the AVA configuration, one can still make the argument that there is a noticeable dissimilarity in computational time required for the HOLDEN-XL configuration. This difference could be explained through the fact that adding another layer in a neural network strictly increases the number of computations, and therefore the time, required for evaluation at runtime. For further discussion on this topic, see Section 6.1.1. 5.2 Responsivity Visualizations Video recordings of these visualizations are available here [48]. Figure 5.3: Some frames from the HOLDEN responsivity evaluation Top left: jogging, Top right: crouching Bottom left: strafing backwards, Bottom Right: walking White globes are Waypoints, see Section 4.1.9. The golden Waypoint is the next positional target of the network. 37 5. Result Figure 5.4: Some frames from the AVA responsivity evaluation Top left: jogging, Top right: crouching Bottom left: strafing backwards, Bottom Right: walking White globes are Waypoints, see Section 4.1.9. The golden Waypoint is the next positional target of the network. In Figure 5.3 and Figure 5.4, one can see examples of the skeleton joint position outputs the HOLDEN and AVA network configurations produced during their re- spective responsivity evaluations. For the AVA configuration, certain errors occurred, potentially as a result of the network not being trained on sufficient amount of data frames, see Section 6.1.3 and 6.4.1 for further discussion. For an example, notice how poorly the produced joint skeleton appears to be crouching in the bottom right photograph in Figure 5.4. Additionally, the AVA configuration failed to adapt to tight turns, making the outputted skeleton overshoot the target, see Figure 5.5. Figure 5.5: Directional overshoot during the AVA responsivity evaluation The goal of the network is to move the skeletal character towards the golden Waypoint. However, the AVA configuration fails to sufficiently turn the character towards this goal before the character has passed it. 38 5. Result 5.3 Accuracy Results All accuracy data, which is used to produce the figures and tables presented in this section, is available in Appendix A.5 and Appendix A.6. The accuracy results presented in Figure 5.6 shows the average error per motion capture data file for each of the four network configurations. The error calculation is defined as presented in Section 3.1.2: 1 |j| ∑ j |tj − pj| tj (5.2) Where |j| is the total number of frames in this file, tj is the three-dimensional position of joint j as defined in the motion capture database, and where pj is the three-dimensional network predicted output position of joint j. The values taken from the motion capture database, including tj, is referred to as the ground truth. Additionally, the fifth data series ‘AVA-M’ shown in this figure represents the results of the AVA network limited to the subset of network outputs equivalent to those joints present in the original motion capture data made public by Holden et al. [45], see Figure 3.1 and Section 3.1.3. Similarly, Figure 5.7 presents the standard deviations, a measurement of spread in the data distribution, of the per motion capture file network outputs for each of the network configurations. A smaller standard deviation equates to little difference be- tween data points within a data series, whereas a higher standard deviation equates to more fluctuating data points. The same accuracy data is presented as a box plot in Figure 5.8, and summarized in Table 5.2. 39 5. Result Figure 5.6: Line chart of mean results of accuracy evaluation The error is defined as mean absolute error compared to the training data. For full definition of error, see Section 3.1.2. For indexing of motion capture files, see Appendix A.2. What can be seen in Figure 5.6 is that the results of the three Holden configura- tions are greatly overlapping throughout the training data set. The blue diamond HOLDEN data series is almost perfectly obscured by the green triangle HOLDEN- XT series. Somewhat similarly, the two Ava results appear to follow a slightly similar curvature, however vertically translated to a lower error level than that of the Holden config- urations. Internally, however, the curvature of the two Ava data series is highly similar, though also vertically translated. In other words, if the AVA-M data series would be shifted downwards in the chart, there would be almost constant visual overlap between it and the AVA data series. However, visually there is almost no similarity in the curvatures of the Holden and Ava configurations. Throughout the entirety of Figure 5.6, the Ava data series produce a lower error than that of the Holden configurations. This is visualized through how the AVA and AVA-M data series are consistently below the other three. The motion capture files that all network configurations performed the worst at, data files indexed at 72-75, were that of the files containing movement interacting with more advanced terrain and environments, such as balancing on elevated narrow beams and dynamic crouching beneath low ceilings. 40 5. Result Figure 5.7: Line chart of standard deviation results of accuracy evaluation The error is defined as mean absolute error compared to the training data. For full definition of error, see Section 3.1.2. For indexing of motion capture files, see Appendix A.2. As a similar trend to the means presented in Figure 5.6, the standard deviations shown in Figure 5.7 has almost perfect overlap for the three Holden configurations. Once again, the blue diamond HOLDEN data series is almost perfectly obscured by the green triangle HOLDEN-XT series. However, the curvature of the AVA-M data series appears to be a a midpoint to that of the Holden configurations and that of the AVA data series. Visually, the AVA-M has local maxima and minima similar to both of aforementioned series. Additionally, the values of the AVA-M series are positionally closer to that of the Holden configurations than to that of the AVA data series. In other words, data points along the turqoise circle AVA-M line are further away from that of the purple crossed AVA line than to those of the other three data series. 41 5. Result Figure 5.8: Boxplot of accuracy results The error is defined as mean absolute error compared to the training data. For full definition of error, see Section 3.1.2. As a reminder: an error of 0.01 equates to an average prediction error of 1%, see Section 3.1.2. As a concrete example; the predicted three-dimensional joint positions that the HOLDEN network configuration produced had, on average, a translational error of ≈ 4.4%. HOLDEN HOLDEN-XL HOLDEN-XT AVA AVA-M Mean 0.0439 0.0448 0.0446 0.0148 0.0258 SD 0.00954 0.00980 0.00981 0.00283 0.00483 Table 5.2: Mean and standard deviation results of accuracy evaluation Values are rounded to three significant digits. For all three Holden configurations, the results have almost perfect overlap in both Figure 5.6 and in Figure 5.7. As such, one can conclude that these three network configurations have practically equivalent accuracy. This is rather interesting as both the HOLDEN-XL and HOLDEN-XT configurations each respectively have a specific advantage, in the form of extra network depth and extra training time, compared to the default HOLDEN configuration. These results suggest that there is no benefit to these specific network design alterations. For the AVA configuration, by combining the visual results of the line chart in Figure 5.6 and the box plot in Figure 5.8, one can conclude that there is a significantly lower error in the AVA prediction than that of the Holden configurations. Lastly, the AVA-M data series appear to share some similarity to both the Holden and AVA data series. To measure this similarity, one may utilize the difference metric used in Section 5.1 and originally presented in 3.1.1; calculating the number of standard deviations between the means. This produces the following results: 42 5. Result • HOLDEN to HOLDEN-XT: 2|0.0439−0.0446| 0.00954+0.00981 ≈ 0.072 • HOLDEN to HOLDEN-XL: 2|0.0439−0.0448| 0.00954+0.00980 ≈ 0.0093 • HOLDEN-XT to HOLDEN-XL: 2|0.0446−0.0448| 0.0448+0.00980 ≈ 0.0073 • AVA-M to HOLDEN: 2|0.0258−0.0439| 0.00483+0.00954 ≈ 2.5 • AVA to AVA-M: 2|0.0148−0.0258| 0.00283+0.00483 ≈ 2.9 These differences can be summarized as the three Holden network configurations producing practically equivalent results, with an especially large overlap between HOLDEN and HOLDEN-XT, and the AVA-M results being slightly closer to that of the Holden configurations than to that of the AVA configuration. 5.4 Accuracy Visualizations Video recordings of these visualizations are available here [48]. Figure 5.9: Some frames from the HOLDEN accuracy evaluation Gray: Ground truth joint positions. Blue: HOLDEN joint positions. Figure 5.10: Some frames from the AVA accuracy evaluation Gray: Ground truth joint positions. Magenta: AVA joint positions. In Figure 5.9 and Figure 5.10, one can see examples of the skeleton joint position 43 5. Result outputs the HOLDEN and AVA network configurations produced during their re- spective accuracy evaluations. 5.5 Training Process During the training process, the prediction mean-squared error of the full output vector was recorded after each epoch. Do note the difference in error definition, and the fact that the entire output vector is used rather than just the predicted skeleton joint positions, compared to the one used in Section 5.3. This data is available in in full in Appendix A.7, and presented as a line chart in Figure 5.11. Figure 5.11: Mean-squared error of entire output vector during training HOLDEN-XT is hidden during the first half of its training process as its design, and therefore results, is entirely equivalent to that of the HOLDEN configuration. Figure 5.11 shows similarity through overlap between the HOLDEN and HOLDEN- XL configurations throughout their training period. Additionally, during the extra training period of the HOLDEN data series, here equivalent to that of the HOLDEN- XT configuration, the mean-squared error remains relatively unchanged. This fur- ther reemphasizes the similarity in accuracy argued in Section 5.3. As a reminder; the green triangle HOLDEN-XT was trained for twice the number of epochs than the other network configurations, which results in a twice as long output error result. For the AVA configuration, one can visibly determine a larger mean-squared error during the entirety of its training process compared to the Holden configurations. This is visualized through the fact that the purple crossed AVA data series lies relatively significantly above the others in Figure 5.11. Additionally, the mean- squared error of the entire output vector appears visibly more irregular between epochs during the training process than that of the Holden configurations. 44 5. Result 5.6 Pipeline Overview This section aims to present the computation time, see Table 5.3, and the data size, see Table 5.4, for each step of the full phase-functioned neural network pipeline, presented in Section 3.3. Step HOLDEN HOLDEN-XL HOLDEN-XT AVA Generate Patches 28min - - - Generate Database 90min - - 55min Network Training 43h 47h 91h 31h Neural Network 0.38ms 0.50ms 0.38ms 1.5ms Table 5.3: Computational time required throughout the pipeline Entries marked ‘-’ share the HOLDEN results. In summary, for the HOLDEN-XL configuration, Table 5.3 shows that there is little difference in training time when adding a new network layer to the Holden network. The much longer training time of the HOLDEN-XT configuration was not unex- pected, as a result of it being trained for twice the number of epochs, see Section 3.1.3. Additionally, the table shows that the database generation, and to some extent the network training, is considerably quicker for the AVA configuration. This means that even though the AVA configuration had six times the number of skeleton joints, see Section 3.1.3, the fact that it only had a fourth of the frames compared to the Holden configurations, see Section 3.1.3, resulted in it being trained considerably faster. Data HOLDEN HOLDEN-XL HOLDEN-XT AVA Height Fields 134MB - - - Terrain Patches 606MB - - - Motion Capture Files 848MB* - - 7.62GB* Phase Labels 7.04MB - - - Other Labels 56.9MB - - - Training Database 7.12GB - - 10.4GB Network Weights 122MB 176MB 122MB 374MB Table 5.4: Size of different data throughout the pipeline Entries marked ‘-’ share the HOLDEN results *: Motion Capture Files are in 120Hz. In summary, Table 5.3 shows the considerable increase in memory size between that of the Holden, to that of the AVA, network configurations. As mentioned in 3.1.3, do note that the AVA database only contains a number of frame data points equal to a quarter of that of the Holden configurations. 45 5. Result Additionally, as discussed previously, given that the HOLDEN-XT configuration differs from the default HOLDEN configuration solely through training time, it is expected that the network weights produced by the two have the same memory size. In contrast, given that the HOLDEN-XL configuration has more network nodes than that of the HOLDEN configuration, it is expected that there are more weights, requiring more memory, for the former. 5.7 Skinning Visualization Video recordings of these visualizations are available here [48]. These were the in-engine results of the network orientational outputs after switching the X- and Z-rotations, and inverting the Y- and switched X-rotations. Figure 5.12: HOLDEN positional and orientational output skinned Left: The skinned HOLDEN character model in default stance. Middle: HOLDEN output skinned. Right: HOLDEN output skinned with visible skeleton. Figure 5.13: AVA positional and orientational output skinned Left: The skinned AVA character model in default stance. Middle: AVA output skinned. Right: AVA output skinned with visible skeleton. In Figure 5.12 and Figure 5.13, one can see examples of the skeleton joint translation 46 5. Result and orientation outputs the HOLDEN and AVA network configurations produced during their respective responsivity evaluations skinned to 3D character models. The HOLDEN skinning, at a quick glance, appears correct in general. Occasionally, certain specific joints experience single-frame orientation errors. For an example, the head and torso sometimes rotate over 180 degrees around an axis in a single frame. As of writing, this error is still being investigated. Similarly, the skinned AVA results also produce certain erroneous orientations, how- ever much more frequently and for more than only two joints. In the middle panel of Figure 5.13, one can see the torso joint being rotated over 180 degrees. As of writing, this error is still being investigated. Additionally, the original skeleton file, on which the motion capture data was retargeted using, was lost and replaced with a new skeleton file for the runtime process. This new skeleton is perfectly equivalent to the old one, except for the facial structure. As such, the facial contortions shown in Figure 5.13 is to be expected. However even though the character model occasionally appears incorrect, the un- derlying skeleton still moves correctly. This shows that the positional outputs of the neural networks are correct, however that is not always the case for the orientational. This is presumably a result of the fact that the network is trained in another set of world-axis orientations, see Section 3.3.4, compared to that of the Apex engine. The positional outputs of the neural networks are manually converted in runtime to match that of the engine, hence the apparently correct skeletal output. This inconsistency in world-axis orientations, however, does not affect the results of the responsivity or accuracy evaluations. This is further discussed in Section 6.4.2. 47 5. Result 48 6 Discussion 6.1 Discussing the Research Questions This section aims to, through discussion, answer the research questions as presented in Section 1.3. 6.1.1 Discussing Responsivity The research question regarding responsivity asked how much computational time is required for procedural single-character locomotive animations, see Section 1.3. This is answered through testing an engine-integrated solution using the four different network configurations. The responsivity results presented in Section 5.1 reveals a considerable computa- tional difference between the Avalanche and the Holden network configurations. This significant distinction in frametime could be explained by the difference in number of skeleton joints. As mentioned in Section 3.1.3, the two skeleton types have 31 and 191 joints, respectively. Since all skeleton joints are fed into the neural network during the runtime prediction process, the width of the networks, and in turn the number of computations each frame, depend on the number of joints. Since the AVA network required a size of 1’302x512x1’751, whereas the two shallower configurations required only a size of 342x512x311, the number of computations required each frame to propagate the character pose through the network is therefore greatly increased in the former configuration. A naive, since it assumes sequential computing, way of counting computations in a simple feed-forward network like the one used in this project is to sum the number of inter-layer network connections and the number of biases, like so: ∑ l nl(1 + nl+1) (6.1) Where: ‘l’ is the index of all non-output network layers, ‘l + 1’ is the index of the subsequent network layer following layer ‘l’, and ‘nl’ is the number of nodes (width) in layer ‘l’. 49 6. Discussion The number of computations per network configuration using the above formula can be seen in Table 6.1. HOLDEN HOLDEN-XL HOLDEN-XT AVA computations 335’501 598’157 335’501 1’566’701 Table 6.1: The number of computations required per network configuration These data points can then be inserted into a line chart to display the linear rela- tionship between the number of skeleton joints and the frametime, such as Figure 6.1. Figure 6.1: Linear regression of computational time over network calculations Beware; this is a gross simplification simply to show a possible relation between network size and computational load at runtime. Regression line: y ≈ 9.6x · 10−7 + 0.018 6.1.2 Discussing Accuracy The second subsidiary research question asked how accurate the generated locomo- tive character animations are to the original animation data, see Section 1.3. This is answered here through relative comparisons between the network configurations. difference between the holdens The three Holden network configurations maintain close resemblance with equivalent results throughout the error evaluation process. This can be seen through consistent overlap between the data shown in both Figure 5.6, Figure 5.7, and Figure 5.8, and with a near zero divergence metric, as presented in Section 3.1.1. Additionally, the mean-squared-error evaluation during the training process also shows no significant difference between the Holden configurations, see Figure 5.11. 50 6. Discussion Through these statistics, one can conclude that there was insignificant gain in either doubling the number of hidden layers in the neural network, as was the case for the HOLDEN-XL configuration, or doubling the duration of the training process, as was the case for the HOLDEN-XT configuration. Additionally, the lack of improvement in results between the default HOLDEN and the long trained HOLDEN-XT configuration is evidence that the original imple- mentation by Holden et al. does not suffer from underfitting, see Section 2.1.4 and Section 3.1.3. ava versus ava-m Additionally, the AVA and AVA-M data series closely follow the same curvature in Figure 5.6, however translated vertically. This curvature is distinctively different from that of the Holden configurations. Given that the AVA data series has consistently less of an error than the AVA-M series, one can deduce that the masked joints, those not present in AVA-M, give rise to a constantly lower error in comparison. The mean absolute error definition used in this project includes a normalizing de- nominator, see Equation 3.1, as to allow the error to be relative to the ground truth positional values. For an example, an error of 0.01 equates to a predicted joint position that is 1% off the ground truth, see Section 3.1.2. However, this means that further away joints, joints at positions with high positional values, would need a larger absolute error to produce the same effective evaluated error than that of a joint closer to the axis origin. As a concrete example, consider a joint a with a ground truth position at at = 100. For this joint to contribute with an error of 0.1, the predicted position would need to be at ap = 101. However, if we consider a different joint b with a ground truth position at bt = 1, the predicted position would need only to be bp = 1.01 to contribute with the same error. Out of the the masked AVA joints, a considerable number of these are highly con- centration within the character head and face, see Section 3.1.3. As these joints are further away from the axis origin, the discrepancy in error between AVA and AVA-M may be a result of the inherent relativity design of the mean absolute error definition used in this project. As such, AVA-M might be more suitable for comparisons with the Holden configurations. ava-m versus the holdens Also, even though the mean-absolute prediction error at runtime of the AVA con- figuration was significantly lower than that of the Holden configurations, see Figure A.5, the mean-squared prediction error during the training process of the former was considerably higher than that of the latters, see Figure 5.11. This difference could be a result of the fact that the runtime evaluation only con- sidered the three-dimensional skeleton joint positions, wheras the training process evaluation considered the entire output vector. This means that it is possible that 51 6. Discussion the AVA network configuration is comparatively much better at joint position pre- diction than at predicting the other output features, such as: joint orientations and velocities, and sample point positions and trajectories, see Section 3.2.3. Additionally, another potential reason behind this difference is the fact that the two evaluation processes used different error definitions; mean-absolute error and mean-squared error. The fact that the absolute error was lower than the squared error could be an indicator that the data, in this case the accuracy of the AVA network, had many extreme outliers. This, as the mean-squared error definition squares the per data point error, meaning that errors smaller than one get reduced and errors larger than one get amplified, compared to that of the mean-absolute error definition. However, a potential source of the difference in error between the AVA and Holden configurations was the choice of evaluation frames per motion capture file, see Section 3.1.2. As a result of insufficient system memory for keeping each joint position in each frame in memory, a decision to only consider the first 500’000 three-dimensional joint positions in each datafile was made. This in combination with the fact that the AVA configuration only considered every fourth frame and that the AVA skeleton had more than six times the skeleton joints, see Section 3.1.3, means that it is probable that if the maximum joint position number is met, the two configurations would consider different blocks of frames within the motion capture data. In other words, if the Holden configurations reached the memory limit of 500’000 joint positions, it will have considered the first 500′000 31 ≈ 16′000 frames. On the other hand, if the AVA configuration reached the memory limit of 500’000 joint positions, it would have instead only considered the first 500′000·4 191 ≈ 10′000 frames. If the different network configurations was evaluated on a different set of motion frames, then that could make for an unfair comparison. 6.1.3 Discussing Architecture The final subsidiary research question asked about how the phase-functioned neural network architecture presented in Holden et al. could be improved, see Section 1.3. An initial hypothesis may be that allowing a neural network to train for a longer period of time, or to have a deeper network, may increase the predictive accuracy of the network. However, when comparing the accuracy results between that of the Holden configurations, see Section 6.1.2, one can conclude that is not the case, at least not for this particular model. This was shown through insignificant difference in accuracy between the three network configurations; default (HOLDEN), double the hidden layers (HOLDEN-XL), and double the training time (HOLDEN-XT), see Section 6.1.2. On the other hand, not just that there was no strictly positive sides of any of the two altered Holden configurations, there was still strictly negative ones. For an example, given that the HOLDEN-XL configuration required an additional layer of network weights and biases, it was shown to require both longer computational time, see 52 6. Discussion Figure 5.1 and Table 5.3, and more memory, see Table 5.4, at runtime. For the HOLDEN-XT configuration, the only clear downside of it, compared to the default HOLDEN network, was the increase in training time, see Table 5.4. As such, the altered Holden configurations have been shown to only perform equally, or worse, compared to that of the default HOLDEN configuration. The fourth network configuration, AVA, that in one aspect may be used to evaluate the generalizability of the phase-functioned neural network approach, had its own fair share of issues. For the responsivity evaluation, the AVA network was shown to be significantly much slower to evaluate at runtime, compared to all other network configurations. This, however, may be quite unsuprising as it required many more computations for each prediction, see Figure 6.1, as a result of it being based on a character skeleton using more than six times the joints than that of the skeleton used for the other network c