Predictive Maintenance
in Production Robots
in a Real World Industrial Setting

A comparative study using different machine learning models
in a real world application

Master’s thesis in Computer science and engineering

ADAM SVENSSON
KARL WENNERSTRÖM

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2023


Master’s thesis 2023

Predictive Maintenance
in Production Robots

in a Real World Industrial Setting

A comparative study using different machine learning models in a
real world application

ADAM SVENSSON
KARL WENNERSTRÖM

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Predictive Maintainance in Production Robots in a Real World Industrial Setting
A comparative study using different machine learning models in a real world appli-
cation
ADAM SVENSSON
KARL WENNERSTRÖM

© ADAM SVENSSON, KARL WENNERSTRÖM, 2023.

Supervisor: Yinan Yu, Department of Computer Science and Engineering
Advisor: Niklas Björk, Sandvik Coromant AB
Advisor: Fräs Jan Jonsson, Sandvik Coromant AB
Examiner: Mary Sheeran, Department of Computer Science and Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2023

iv


Predictive Maintainance in Production Robots in a Real World Industrial Setting
A comparative study using different machine learning models in a real world appli-
cation
ADAM SVENSSON
KARL WENNERSTRÖM
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
With the exponential growth of data and advancements in AI technology, Predic-
tive Maintenance (PdM) has emerged as a vital practice for optimizing equipment
maintenance and minimizing unplanned downtime. This study was performed in
collaboration with Sandvik Coromant, a company producing steel products and
actively collecting data for evaluation purposes, to investigate how the collected
data can be utilized for decision-making processes. Specifically, the study analy-
ses vibrational data to address the issues related to unexpected tool malfunctions.
Based on the available data, anomaly detection was identified as the most suitable
approach to leverage the stored data based on its characteristics. A comparative
study where various anomaly detection models were evaluated demonstrated that
a reconstruction-based LSTM autoencoder yields the highest performance. The re-
construction approach exhibited its effectiveness in detecting and flagging potential
abnormalities, capturing 71% of the malfunctions with an F1 score of 0.75 for the
data used for the comparison. Extending the model to other tools displayed the
challenges posed within time-series analysis, proving unique characteristics for each
case. The findings from this study provide valuable insights into the implementa-
tion of anomaly detection techniques for leveraging collected data and enhancing
decision-making processes in Sandvik Coromant and similar industrial settings.

Keywords: Machine learning, Time series reconstruction, Time series forecasting,
Anomaly detection.

v


Acknowledgements
We would like to express our deepest gratitude to our supervisor, Yinan Yu, for
always guiding us when fumbling in the dark with your insightful feedback and late-
night replies. Your ability to find a new perspective on each situation contributed
greatly to this work.

In addition, we would like to extend our appreciation to the entire team at Sandvik
Coromant, with special recognition to Niklas Björk for assisting us in organizing
and outlining the project, and Fräs Jan Jonsson for patiently addressing all of our
numerous questions. To all of you: you made us feel welcome from day one and your
invaluable support and guidance have been instrumental in our progress. You helped
us with your high spirits and provided us with the resources needed to complete this
project. We are grateful for your trust in our abilities and the opportunity to work
on such an exciting project.

Also, a big thank you to Syntronic for providing us with workstations at your office.

Last but not least, we are deeply thankful to family and friends for all the love and
support you have given us during this project, despite our absence. We could not
have done this without you.

Adam Svensson, Karl Wennerström, Gothenburg, 2023-06-16

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Achieving Industry 4.0 . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Background at Sandvik . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2.1 Data Infrastructure . . . . . . . . . . . . . . . . . . . 5
1.1.2.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Our Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Project aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Theory 7
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 12
2.4.2 Graph Attention Network . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Autoencoder Architecture . . . . . . . . . . . . . . . . . . . . 13

2.5 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.1 Forecasting Based Anomaly Detection . . . . . . . . . . . . . 16
2.6.2 Reconstruction Based Anomaly Detection . . . . . . . . . . . 17
2.6.3 Anomaly Score Thresholds . . . . . . . . . . . . . . . . . . . . 17

2.6.3.1 Epsilon Threshold Method . . . . . . . . . . . . . . . 17
2.6.3.2 Rolling MAD Threshold Method . . . . . . . . . . . 18

2.6.4 Anomaly Detection Evaluation Metrics . . . . . . . . . . . . . 18

3 Methods 21
3.1 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Technique Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 24

ix


Contents

3.3 Data Preprocessing and Feature Engineering . . . . . . . . . . . . . . 26
3.3.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.4 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Model Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 MTAD-GAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 LSTM Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.5 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Potential Pitfalls in the Data . . . . . . . . . . . . . . . . . . . . . . 32

4 Results 35
4.1 Comparison of Approaches . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Predicting RUL . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 ROC and PRC . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Use-case Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4 False Positive Analysis . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Model versus the Current Workflow . . . . . . . . . . . . . . . . . . . 45

5 Discussion 47
5.1 Approaching Data-Driven Decision Making . . . . . . . . . . . . . . . 47
5.2 Anomaly Detection Model . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Improvements from a Model Perspective . . . . . . . . . . . . . . . . 49
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Takeaways 53

Bibliography 55

A Appendix - Features I

B Appendix - Model Parameters III

C Appendix - Model Implementation V

x


List of Figures

1.1 The digital platform framework present at the company. Connectors
are used to collect raw machine signals from the machines, allowing
to connect to various electronic components. Raw signals are stored
in the PI Data Archive and is preprocessed in PI Asset Framework
to possess human readable information, which can then be used in
several user tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Forecasting-based and reconstruction-based models. Image from [13]. 10
2.2 Multivariate time series can be transformed into fully connected graphs

in the temporal and feature dimensions, allowing for modeling in each
dimension separately. Images from [14]. . . . . . . . . . . . . . . . . . 14

2.3 Basic autoencoder architecture overview. From input, the data is
compressed to a latent representation through a decoder. And the
decoder reconstructs the input data from the latent representation. . 14

3.1 Number of malfunctions per tool and machine. . . . . . . . . . . . . . 23
3.2 Time series data around a malfunction, with an abnormal value prior. 24
3.3 Time series data around a malfunction, without outlier. . . . . . . . . 25
3.4 Boxplots for the vibration signals in the full dataset and in close

relation to malfunctions. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Example of how the two cleaning steps can be applied to the data,

removing extreme values and bridges. . . . . . . . . . . . . . . . . . . 27
3.6 A sample sequence showcasing the impact Material ID has on the

vibration signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 MTAD-GAT architecture, image from [14]. . . . . . . . . . . . . . . . 30
3.8 A symmetrical LSTM autoencoder architecture with 3 LSTM layers

with sizes 64, 32 and 16. . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 A sample from a vibration signal acting suspicious a couple of se-

quences before the malfunction was found. The dashed lines indicates
a new sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 The life counter for a contiguous period when the machine is in pro-
duction, for a specific tool. The left figure portrays the extremely
large values and the right figure demonstrates temporary resets of
the counter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xi


List of Figures

4.2 Validation loss for MTAD-GAT trained on tool 4503 and machine 2
for different learning rates and weight decays. . . . . . . . . . . . . . 37

4.3 Forecast and reconstruction loss, respectively, for MTAD-GAT trained
on tool 4503 and machine 2 for different window sizes. . . . . . . . . . 38

4.4 ROC for all models. The subscripts in the legends indicate the window
size for reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 PRC for all models. The subscript in the models’ name indicates the
window size for reconstruction. . . . . . . . . . . . . . . . . . . . . . . 41

4.6 True positive classification. . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 False positive classification. . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Reconstruction errors and threshold. FP classification near true anomaly

sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

C.1 High level flowchart of model implementation. . . . . . . . . . . . . . VI

xii


List of Tables

1.1 Necessary requirements to successfully achieve an industry 4.0 scenario. 3

2.1 Classification outcomes after applying anomaly detection models. . . 18

3.1 Important signals: name and description. . . . . . . . . . . . . . . . . 22
3.2 The number of malfunctions for each tool and machine, together with

the total number of data points. . . . . . . . . . . . . . . . . . . . . . 23
3.3 The enumerated sequences preceded by one or multiple abnormal values. 24

4.1 Frequency for different values of maximum life time for the chosen,
contiguous sequence of 1000 points. . . . . . . . . . . . . . . . . . . . 36

4.2 Model evaluation in a use-case scenario, for different thresholds. . . . 41
4.3 Re-evaluating performance, disregarding FP sequences and FPs in

close relation to true anomaly sequences. . . . . . . . . . . . . . . . . 44
4.4 False positives (true positives) for each feature for autoencoder10. . . 44
4.5 Performance evaluation for the subset of tools selected for machines

1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A.1 Extracted signals: name and description. . . . . . . . . . . . . . . . . I

B.1 MTAD-GAT Hyperparameters . . . . . . . . . . . . . . . . . . . . . . III

xiii


List of Tables

xiv


1
Introduction

The mitigation of equipment failure in production environments is crucial for all man-
ufacturing companies. Such failures can result in maintenance costs that account
for a significant portion of overall manufacturing expenses, ranging from 15% to
60% [1]. These expenses may include the production of faulty products, equipment
repairs, and production downtime [2]. To address this issue, advanced technologies
such as artificial intelligence (AI) have been proposed as a solution, enabling smart
manufacturing and paving the way for Industry 4.0, the fourth industrial revolution
[3]. In an Industry 4.0 setting, components, machines, and production systems are
integrated and connected, allowing for the real-time transfer of information between
production site components and employees [4]. The access to time series data, col-
lected at regular intervals from machine and sensor signals, forms the foundation
for utilizing AI advancements in predictive maintenance (PdM) models, to make
informed and predictive decisions that improve production efficiency.

PdM models, which can be statistical, machine learning (ML), or deep learning
(DL) based, rely on historical data to forecast equipment breakdowns and provide
insight into the decision-making process [5]. The choice of PdM model depends on
the characteristics of the data and the specific needs of the facility. Two common
approaches are (1) predicting the remaining useful life (RUL) of the equipment or
(2) identifying unexpected behavior, also known as anomaly detection. Previous
research has explored statistical models such as ARIMA, ML models such as SVM
and RF, and DL models including AutoEncoder, MTAD-GAT, LSTM, ANN, and
RNN. However, there is no general recipe for achieving an accurate PdM model
as each project has unique circumstances [4]. Anomaly detection studies are often
conducted on synthesized data, leaving the need for further investigation in real-
world applications.

In this project, a real world scenario is studied where time series data from produc-
tion machines at Sandvik Coromant AB is analyzed, in order to find an appropriate
approach to reach an industry 4.0 scenario. The vast amount of data collected on a
daily basis helps monitoring and evaluating the production, but the aim is to utilize
the data for decision-making processes in order to reduce production downtime and
thus increase the overall production efficiency. Due to the challenges posed by time
series forecasting, such as imbalanced data, insufficient labeled data and erroneous
data, the first objective is to decide whether the approach to predict RUL or de-
tecting anomalies is the appropriate approach. Then, a comparative study between

1


1. Introduction

various models within the chosen technique is carried out, focusing on the trade-off
between detected anomalies and false alarms, together with other requirements pre-
sented by the company. Finally, an analysis of the results is performed, investigating
the effects that a PdM model can accomplish.

1.1 Background
In this section, a background to the requirements for industry 4.0 is presented,
together with the current workflow at Sandvik Coromant AB.

1.1.1 Achieving Industry 4.0
Industry 4.0 is seen as a way to increase manufacturing efficiency, reduce costs, and
improve product quality [5]. The ultimate goal of Industry 4.0 is to create a fully
connected and automated production process, where machines can communicate
with each other and production can be optimized in real-time. The use of Industry
4.0 technologies is becoming increasingly important for manufacturing companies to
stay competitive in the global market [1]. It enables companies to adapt to changes
in demand and produce goods in a flexible and efficient way. However, the adoption
of Industry 4.0 technologies requires significant investment, as it often involves the
redesign of production lines, the installation of new equipment and the training of
personnel to operate and maintain the new technology.

Industry 4.0 has brought significant changes to the manufacturing industry, includ-
ing advancements in maintenance strategies. There are four types of maintenance
strategies: (1) corrective, (2) preventive, (3) predictive, and (4) prescriptive [5]. In
corrective maintenance, also known as run-to-failure (R2F), the production process
continues until equipment failure occurs. While such a strategy exploits equipment
lifetime, each failure cause major impact in the production environment. Preventive
maintenance, on the other hand, is scheduled maintenance. The time interval be-
tween maintenance occasions should be large enough to utilize the expected lifetime
of the equipment while also minimizing the risk of an equipment failure. Predictive
maintenance is a strategy that utilizes equipment information to enable more flexi-
ble actions, allowing for both optimal utilization of equipment and the discovery of
unexpected behavior that may indicate premature malfunction. Data is streamed
from sensors and analyzed to make predictions on equipment status, using various
machine learning techniques, and appropriate actions are taken accordingly. The
highest level of maintenance is prescriptive maintenance, which combines a pre-
dictive system with suggested measures and, in some cases, even automates these
actions. However, the successful implementation of these techniques requires certain
prerequisites [3][1].

To transition from real-time data monitoring and manual equipment inspection to
a data-driven decision-making approach, certain conditions must be met [1]. These
conditions relate to data collection during production processes, evaluation capabil-
ities, and the IT system for monitoring equipment conditions. The specific, relevant
requirements are listed in Tab. 1.1.

2


1. Introduction

Table 1.1: Necessary requirements to successfully achieve an industry 4.0 scenario.

Area Requirements

Data collection
Continuous inspection
Sensors and other data

Digital recording

Performance measurement Digital trend analyses

IT Condition monitoring software
Big data platform

The foundation is the collection of data. All processes regarding monitoring or
analyzing the data in any fashion require existing and accurate data. In [3], the
authors propose a list of question to evaluate how well the situation at hand is
equipped to approach a PdM scenario. The appropriate questions, related to the
data, are posted below (not ordered by relevance).

1. Can the problem be formulated as a binary classification problem?

2. Are predictor features available?

3. Are all sources of variation captured by the predictor features?

4. Are the predictor features strongly related to the underlying physics of the
project?

5. Is the training data (i.e., sample size) big enough?

6. Are there labels (i.e., good, defective) available?

7. If there are no labels, how long would it take to generate labels?

Analyzing these questions can provide valuable insights into the feasibility and po-
tential success of a PdM project. It is worth noting that these questions can be
adapted to suit the specific requirements of each individual project and its unique
setting. For instance, depending on the project’s characteristics, it may be more
appropriate to formulate the problem as a multi-classification or regression task.

In the following section, we will discuss the requirements for a successful PdM im-
plementation from the perspective of the company.

1.1.2 Background at Sandvik
Sandvik Coromant AB is a Swedish multinational engineering company that special-
izes in manufacturing and selling cutting tools and solutions for the metalworking
industry. The company was founded in 1942 in Sandviken, Sweden and todays
manufacturing is mainly done in their industrial facility in Gimo, Sweden. Sand-
vik Coromant is a subsidiary of the Sandvik Group, a global industrial engineering

3


1. Introduction

company. Their product portfolio includes a wide range of cutting tools, including
milling cutters, turning tools, drills, and boring tools, as well as tooling systems and
solutions for various applications. The products are used in a variety of industries,
including aerospace, automotive, medical, oil and gas, and general engineering.

Sandvik Coromant has a global presence, with sales offices and manufacturing fa-
cilities in over 130 countries. The company employs over 7,000 people worldwide
and generated sales of approximately SEK 20 billion in 2020. Sandvik Coromant is
recognized as a leading provider of cutting tools and solutions.

At Gimo, there are more than 80 production machines, where some of them are
tasked with producing the same products. A production robot has on average a
life length of up to 25 years, according to experts at Sandvik, thus causing the
production facility to have various age of their machines. The machines involved in
this project operate using a single tool at a time, which is referred to as the master
tool. However, they are designed to hold multiple tools in order to minimize the time
required to switch between tools. Additionally, each tool has one or several backups,
referred to as the sibling tool, which can be used if the master tool is unavailable
or has malfunctioned. Generally, the tool with the shorter estimated remaining life
is used. For each tool, the theoretical remaining life is displayed, counting down
from a fixed level, the maximum life length, specified by the type of the tool. After
each use, laser measurements are utilized to inspect the condition of the tool. If it
is broken, production stops and the quality of the previously produced products are
manually inspected.

The production site in Gimo is Sandvik’s most automated and digitized factory,
with a plant performance team dedicated to analyze the information derived from
both raw machine signals and manually installed sensors, capturing machine specific
signals and vibrational data from the machine. In general, hundreds of signals are
collected from each machine and their integrated sensors. This stream of data is the
key to efficient plant performance, enabling monitoring and valuable insights regard-
ing operational statistics, managed via a digital platform. Thus, the requirements
posed in Tab. 1.1 are fulfilled. Further insight in how eligible the current workflow
is for industry 4.0, is obtained by answering the questions declared above.

The company has been collecting data for many years for some features, while other
features correspond to more recently installed sensors. The sampling frequency is
configurable, but generally the values are logged per second, averaging the values
collected during that time frame. Overall, it is safe to say that the collected data
compose enough data. The features collected from the machines and their sensors
constitute the predictor features. Amongst these are information about physical
aspects related to the drilling and turning processes, for example vibrations and ro-
tational speed. Indications of when equipment is malfunctioning are also reported.
The company also keep track of estimated remaining life and the time each equip-
ment has been operating. Combining this with equipment malfunction enables both
labels for a regression model but also the construction of anomaly sequences. Ac-
cording to the answers to the questions, the conditions appear to be favourable for
approaching a PdM strategy.

4


1. Introduction

Figure 1.1: The digital platform framework present at the company. Connectors
are used to collect raw machine signals from the machines, allowing to connect to
various electronic components. Raw signals are stored in the PI Data Archive and is
preprocessed in PI Asset Framework to possess human readable information, which
can then be used in several user tools.

1.1.2.1 Data Infrastructure

An overview of the digital platform is presented in Fig. 1.1. The raw signals are
stored in a PI Data Archive, which is a traditional data historian database storing
signals together with a timestamp. In order to reduce the amount of storage, most
signals are only stored whenever there is a significant change between measurements.
The threshold for what is a significant change differs between signals.

In order to translate raw data to information, they apply logic operations to extract
useful information about the signals. This information is further processed and used
in user-friendly tools for both plant workers by monitoring machine health and office
workers for analyses of plant performance, such as overall equipment efficiency.

1.1.2.2 Problem

The current workflow at Sandvik Coromant AB for the selected machines incor-
porates the preventive maintenance strategy, which infers that the equipment is
replaced according to scheduled occasions. In the case of a tool breakage, the re-
cently manufactured products need to be manually inspected to distinguish between
defected and undamaged goods to ensure product quality. This results in produc-
tion downtime for equipment maintenance and quality assessment, but also a loss
of revenue due to increased usage of materials and resources. If these issues can
be mitigated, substantial cost savings can be achieved by the company. The most
promising information is described to be encapsulated in the vibrational data, where
experts already can visually determine that a tool is new by larger vibrations than
usual.

5


1. Introduction

1.2 Our Project
In this section we present the scope of this project and our research questions.

1.2.1 Project aim
The overall goal with this project is to investigate how the collected time-series
data from the production of Sandvik Coromant in Gimo can be utilized in order
to mitigate the effects caused by tool malfunction. In addition, it will also be
investigated which improvements such a data driven solution would entail. This
will be achieved by studying collected data from specific machines, equipped with
sensors to report vibrational signals. A subset of tools is selected to perform the
experiments on. Starting from tool-level, we identify a model, statistical or ML-
based, to identify tool malfunctions based on vibrational data, with the aim that
such a model can generalize to unseen tools.

In order to concretize the goals of this project, we formulate one broader research
question:

• How can production be improved with the use of collected time-series data?

The high-level question can be broken down into three sub-questions appropriate for
the outline of the project:

1. What is an appropriate approach to utilize the collected data based on its char-
acteristics?

2. What is an appropriate model for detecting malfunctions in the equipment of
industrial robots at Sandvik Coromant AB?

3. Which improvements can be achieved by (2) without loss of quality and mini-
mizing the production downtime?

Given the sub-questions, the primary focus of this project is to find a model ac-
cording to (2) by comparing models of various character and complexity. Their
performance will be evaluated in terms of detected malfunctions combined with the
amount of raised false alarms.

1.3 Thesis Outline
In this section we give a brief overview of structure the project. The second chapter
contains the theory that has been used in the project. It introduces time series,
machine learning building blocks and theory related to anomaly detection. The
methodology is further explained in chapter three. In chapter 4, the results of the
experiments are showcased followed by discussion in chapter 5. The conclusions as
well as the future work and improvements are presented in chapter 6. Lastly, the
key takeaways are provided in chapter 7.

6


2
Theory

The theory chapter of this thesis presents the theoretical framework underpinning
the research. It provides an overview of key concepts and ideas that shape the study
and is used when exploring the research questions. First, an overview of the related
work is presented. Then, a description of time series data and its characteristics is
given. Finally, machine learning and related concepts are introduced to form the
foundation needed to understand the approaches investigated in this project: RUL
prediction and anomaly detection.

2.1 Related Work
Predictive Maintenance Approaches: PdM in production environment is a
critical research area as it can lead to significant improvements in production down-
time, product quality, and cost savings. There is no one-fits-all solution. Some stud-
ies have approached PdM in a supervised fashion, by predicting RUL [6][2][7]. An-
other common approach is to use anomaly detection, a technique applicable within
many areas, either by forecasting data [8][9], reconstructing data [10][11][12], or a
combination of the two [13][14].

RUL Prediction Models: Predicting RUL has been performed with various mod-
els and within different production scenarios. Lee et al [6] use support vector ma-
chines (SVM) and artificial neural networks (ANNs) to predict the RUL of a spindle
motor and cutting tool in a milling process. Signals of vibrations, current, and
acoustic emissions are collected during run-to-failure experiments, at a frequency
of 250Hz. From the same experiments, labels are manufactured. Another study
performs a full implementation in a manufacturing facility producing personal care
goods [2]. By using signals collected at a 3-6s interval, containing, for example,
temperature, electrical current and air pressure, a comparison of various machine
learning models demonstrates that a random forest (RF) model is the most success-
ful one. Ren et al. [7], analyzes vibrational measurements from bearing parts of
an engine, collected from sensors in an experimental setting. They acknowledge the
challenges within feature compression and perform time-frequency transformation
to simultaneously analyze the vibrations in both the time and frequency domains.
The authors combine deep autoencoders with deep neural networks to increase pre-
diction performance. The mentioned studies highlight that various approaches to
predict RUL are possible.

7


2. Theory

Anomaly Detection Models - forecasting vs reconstruction: Anomaly de-
tection is another approach to PdM. It involves identifying abnormal or anomalous
behavior in machines or systems, which could be an early indication of potential
failure or malfunction. An end-to-end framework for anomaly detection, Sintel [15],
includes three stages: preprocessing, modeling and postprocessing stage. The pre-
processing stage transforms raw signals into usable time series. Once the signals are
processed and ready to use in various anomaly detection models, the transformed
input signals are reconstructed, forecasted, or both. In the postprocessing stage, the
predicted signals are compared to the actual signals, resulting in error measurements
between the expected and actual signals. This project follows the proposed pipeline.

In the forecasting anomaly detection scenario, Ho et al. [16] compared ARIMA,
feed forward neural networks and deep learning models, such as RNN, to predict
failures in compressors. The findings showed that ARIMA performed well for short-
term forecasting, but for longer horizons, deep learning models performed better.
Similarly, Makridakis et al. [17], draw the same conclusion, after analyzing time
series from various fields, and highlight the trade-off between accuracy and model
complexity. Thus, even if the deep learning models perform better, the required
computational power might make them infeasible. Hundman et al. [9] utilizes deep
learning, specifically LSTM networks, to detect anomalies within spacecraft data
obtained from NASA. The authors propose a dynamic error threshold, to mitigate
false positives (FPs) by a pruning process where previously classified anomalies can
be reclassified as normal. However, this method requires that all predictions are
available before reclassifying is made possible. Thus, for this project, the dynamic
error threshold is not feasible. Due to its proven effectiveness, ARIMA will be used
as one of the models for short-term forecasting in our project.

Reconstruction-based methods aim to learn a compressed, low-dimensional represen-
tation of the input data that can reconstruct the original input with high accuracy.
Such techniques assume that the learned latent space captures the dominant patterns
in the dataset. However, rare events such as anomalies are often not well-represented
in the latent space and are less likely to be accurately reconstructed. Deep learning
models have shown promising results for detecting anomalies in multivariate time
series. Malhotra et al. [18], proposes an LSTM autoencoder model for detecting
anomalies in datasets with different levels of predictability. More specifically, the
model performs well on a dataset with more unpredictable behavior, containing sen-
sor readings of, for example, coolant temperature, torque, and acceleration. Also,
generative adversial networks (GANs), have been suggested as models for different
scenarios. Li et al. [12], propose multivariate anomaly detection with GAN (MAD-
GAN) to detect cyber-attacks in water treatment systems. Time series anomaly
detection with GAN (TadGAN), proposed by [19], outperforms MAD-GAN on mul-
tiple datasets, presenting different reconstruction error measurements, such as point-
wise difference and area difference.

Hybrid Anomaly Detection Models: For the combination of reconstruction
and forecasting anomaly detection models, Wong et al. [13] propose autoencoder
with regression (AER) to detect anomalies in univariate time series. AER extends

8


2. Theory

the normal usecase of an autoencoder by yielding three components: a one-step-
backward prediction, a reconstructed sequence, and a one-step-ahead prediction. A
multivariate approach is proposed by Zhao et al. [14], called multivariate time se-
ries anomaly detection with graph attention network (MTAD-GAT). MTAD-GAT
treats each univariate time series as a feature, utilizing two parallel graph attention
networks to capture both feature and temporal dependencies. It combines recon-
structing the time series, performed by a variational autoencoder (VAE), with a
one-step-ahead prediction. The threshold for detecting anomalies is determined by
the pruning technique mentioned above.

Overall Model Performance: Context, data availability, and computational
constraints play a significant role in model selection. For instance, RF performs well
for RUL prediction. When there is not enough annotated data, supervised RUL
prediction is not feasible, where anomaly detection models are typically considered.
ARIMA is effective for short-term forecasting, and deep learning models perform bet-
ter for longer-term predictions. For unpredictable time-series, reconstruction-based
models have proven to work well and especially LSTM autoencoders. Thus, ARIMA
and an autoencoder with LSTM layers are investigated for this specific setting. In
addition, MTAD-GAT, a hybrid model, is implemented with modifications to suit
real-time scenarios in this project.

2.2 Time Series Analysis
A time series is a collection of observations Xt, t ∈ T , where T is a finite index
set {1, 2, ..., n}. Every observation is made sequentially over time, meaning that the
data points are collected or recorded in a specific order according to their correspond-
ing timestamps. If a single variable changes over time, the time series is referred
to as a univariate time series. On the other hand, if there are multiple variables
changing over time, the collection is defined as a multivariate time series. Historical
time series data contains valuable information that can be harnessed for various pur-
poses. It often exhibits temporal dependencies, characterized by repeated patterns
or seasonality, as well as underlying trends. Identifying these patterns is critical
in forecasting scenarios. It can also have distributional patterns which are studied
when a reconstruction scenario is relevant.

Time series forecasting implies predicting future values of the time series, based on
historical values. It can be achieved in various ways and is commonly evaluated
by comparing the error between the prediction and the actual value, meaning that
evaluation can only be performed when the predicted time step has occurred. When
the time series is univariate, only one variable is predicted. On the other hand, when
the time series is multivariate, then multiple variables are predicted [13].

Another approach to analyze a time series is through the process of time series
reconstruction. Reconstruction-based models operate by generating a reconstruction
of a given time series sequence. These models utilize an encoding step to embed the
data in a low-level representation from which a reconstruction is made in a decoding

9


2. Theory

Figure 2.1: Forecasting-based and reconstruction-based models. Image from [13].

step. The idea is to learn a representation of the normal data, so that when inputting
a sequence with abnormal values the reconstruction will deviate a lot compared to
the true values [20]. In a multivariate scenario, these steps are performed individually
for each univariate time series. Evaluation of a reconstruction-based model is done
by comparing the reconstructed values with the true values. The comparison can
be performed between either whole sequences or specific points, depending on the
use-case. An illustration of forecasting and reconstruction made on a univariate
time series, respectively, can be seen in Fig. 2.1.

2.3 Machine Learning
Machine learning is a sub-field of AI that involves developing algorithms and models
that can learn from data and make predictions based on that data. Industries are
exposed to an increasing amount of data, which makes machine learning models
very powerful since they can automate processes and improve their accuracy as the
data collection increases. Within machine learning there are different approaches
depending on the available data. In this project, two main categories are relevant,
namely supervised learning and unsupervised learning.

Supervised learning is a type of machine learning where the model is trained on
labeled data. Generally, such a machine learning model h(·) can be described as

hΘ : RN×k → RN×k̂, (2.1)

where Θ is a set of weights and the aim is to map the input data X to predictions
Ŷ. N is the number of input samples and each sample has k features. For each
input an output is yielded with k̂ features. The objective is to find a set of weights
Θ∗ such that

Θ∗ = arg min
Θ

JΘ(Ŷ − Y), (2.2)

where J is a loss function, measuring the discrepancy between the predicted and the
actual value, and Y is the target. By splitting the data into train and test sets the
idea is to learn Θ∗ which can be used to predict the output for the unseen data in the
test set. Supervised learning can be further divided into, for example, classification
and regression tasks. In classification tasks, the objective is to predict a categorical

10


2. Theory

target variable. In regression tasks, the objective is to predict a continuous target
variable, for example the RUL.

On the other hand, a scenario where target labels are not available lies within the
category of unsupervised learning, where the goal is to identify hidden patterns,
groupings, and structures in the data. Anomaly detection is typically an unsuper-
vised scenario, where the labels regarding if a point is an anomaly or not, are not
present during training. Instead, the model aims at mimicking the data by either
forecasting or reconstructing the data. In forecasting, Y from Eq. 2.2 becomes Xt+1
given that data until time step t is provided as input. In reconstruction, Y is instead
the input data X.

For both supervised and unsupervised learning, the input data is often transformed
by some scaling or embedding operation in order to equate the importance of each
feature. Scaling can be performed to obtain the features in the same range or
distribution. Embedding operations like one-hot encoding are done for categorical
variables where each unique category becomes a binary feature if the category is
present (1) or not (0).

2.4 Neural Networks
Neural networks are a type of machine learning algorithm, that are made up of layers
of fully or partly interconnected nodes, or neurons, that process information in a
way that allows them to learn and make predictions based on input data. Neural
networks are able to handle complex, nonlinear relationships among the variables in
the time series data. As the number of layers and neurons increases |Θ| increases,
resulting in a more complex model, enabling it to extract deeper relationships within
the data. However, this complexity also makes the model more computationally
expensive [21].

Multivariate time series often have multiple variables that interact with each other in
intricate ways and neural networks can be capable of capturing these interdependen-
cies and patterns. Relevant features can be automatically extracted from the data
by them, which can help to reduce the dimensionality of the data and improve the
accuracy of predictions. In addition, neural networks can be trained to model both
short-term and long-term dependencies in the data, which is crucial for accurately
predicting future values in a time series.

The basic building block of a neural network is the neuron, which receives input
signals from other neurons and processes them to produce an output signal. The
input signals are weighted according to their importance and the neuron applies an
activation function to the weighted sum of the inputs to produce its output. The
activation function can be linear or nonlinear and it is often chosen based on the
specific task at hand [21].

Neurons are organized into layers, with each layer processing information and passing
it on to the next layer. The first layer of a neural network is the input layer, which
receives the initial data. The output layer produces the final output of the network

11


2. Theory

and one or more hidden layers in between provide additional processing and feature
extraction.

There are many different types of networks. For this project, some relevant ones are
recurrent networks, graph attention networks, and autoencoders.

2.4.1 Recurrent Neural Networks
Recurrent neural networks (RNNs) are designed to work with sequential data such
as time series data or text data, where the previous inputs and outputs of the net-
work are used to inform the current output [21]. In multivariate time series analysis,
RNNs can be trained to model complex and nonlinear relationships between vari-
ables at different time steps. This enables them to capture temporal dependencies
and patterns that may not be apparent using other methods.

A long short-term memory (LSTM) model is a specific type of RNN that uses a
complex gating mechanism to control the flow of information through the network,
allowing it to selectively forget or remember information from previous time steps
[22]. Gated recurrent units (GRUs) are a simplified version of LSTM networks, that
uses fewer gating units that adaptively control the information flow inside the unit,
leading to a powerful model with fewer parameters to train [23].

The specific approach depends on the availability of the data. For example, RNNs
can be used in supervised learning approaches when predicting RUL by leveraging
labeled historical data to learn temporal patterns and dependencies that contribute
to the degradation of the equipment. An additional application of RNN is the
detection of anomalies in unlabeled data. This is accomplished by training the RNN
on a dataset consisting mainly of normal or expected data, enabling the network to
learn these patterns. Subsequently, the deviations between the true values and the
predicted values can serve as an indicator of potential anomalies.

2.4.2 Graph Attention Network
Graph Attention Networks (GATs) are a specialized type of neural network archi-
tecture that is specifically designed for processing graph-based data, as opposed to
the grid-like data structure of images that traditional neural networks are designed
for. GATs excel at handling non-Euclidean data, such as graphs, and employ atten-
tion mechanisms to identify and focus on the most significant parts of the graph for
solving the given problem. This enables the model to pay close attention to specific
nodes and edges in the graph, which can enhance the overall accuracy and efficiency
of the network [24]. In multivariate time series, GATs can be used to model both
the temporal dependencies and the dependencies across the feature space, as seen
in Fig. 2.2.

In more detail, the input to a single graph attention layer is a set of node features
x = {x1, x2, ..., xn}, where each xi∈ Rk, where n and k are the numbers of nodes
and features respectively. The self-attention scores, determining the contribution of
node j’s features to node i, are computed according to

12


2. Theory

eij = aT LeakyReLU(W[xi ⊕ xj]) (2.3)

as proposed in Brody et al [25]. Here W and aT is a learnable weight matrix and
vector, respectively. The operator ⊕ denotes concatenation. For a small scalar
k > 0, LeakyReLU is a modified ReLU-function defined in Eq. 2.4 [26].

LeakyReLU(x) =

x, if x ≥ 0
−kx, if x < 0

(2.4)

A softmax function is applied to the attention scores eij and the normalized attention
scores between node i and node j is represented by the coefficients αij and can be
expressed as

αij = softmax(eij) = exp(eij)∑
k∈Ni

exp(eik)
, (2.5)

where Ni is the set of nodes, with edges connected to node i. The output from the
single graph attention layer, x′

i is computed in Eq. 2.6 using a non-linear activation
function σ.

x′
i = σ

( ∑
j∈Ni

αijWxj

)
(2.6)

When utilizing GAT in a multivariate time series scenario, two graphs are created:
one where each node represents a single feature across all timestamps shown in
Fig. 2.2a and another where each node represents a single timestamp across all
features shown in Fig. 2.2b. Thus, a GAT network can leverage both the temporal
dependencies across time steps and the relations between features making it useful
in detecting patterns in multivariate time series datasets.

2.4.3 Autoencoder Architecture
Autoencoders are a type of neural network commonly used for unsupervised learn-
ing tasks [21], with the main task of reconstructing data. They consist of two parts:
an encoder and a decoder. The encoder compresses the input data using a series
of layers such as convolutional, fully connected, or RNN layers [27]. These layers
progressively reduce the dimensionality of the input, capturing the most important
and relevant features. The output of the encoder, which is the compressed repre-
sentation, is often referred to as the latent space or bottleneck layer. The decoder
consists of layers that gradually expand the dimensionality, mirroring the encoding
layers but in reverse. The reconstruction process is driven by the learning objective
of minimizing the loss between the reconstruction and the observation.

A representation of its architecture can be seen in Fig. 2.3. In a scenario where
sequential data is analyzed, autoencoders are often combined with RNNs to keep
temporal dependencies. In these cases LSTM or GRU layers are used for the en-
coder and decoder part, passing also the sequential dependencies between the layers.
Hence, autoencoders are suitable for time series reconstruction.

13


2. Theory

(a) Feature oriented GAT layer treats the input data as a fully connected
graph, where each node corresponds to the values of a single feature over
all timestamps within the sliding window.

(b) Time oriented GAT layer treats the input data as a fully connected
graph, where each node corresponds to the values of a single timestamp
over all features within the sliding window.

Figure 2.2: Multivariate time series can be transformed into fully connected graphs
in the temporal and feature dimensions, allowing for modeling in each dimension
separately. Images from [14].

x1

x2

x3

x4

x5

x6

Input Layer
Latent

Representation

x̂1

x̂2

x̂3

x̂4

x̂5

x̂6

Output Layer

Figure 2.3: Basic autoencoder architecture overview. From input, the data is com-
pressed to a latent representation through a decoder. And the decoder reconstructs
the input data from the latent representation.

14


2. Theory

2.5 ARIMA
ARIMA models utilize a statistical modeling approach to find the best fit of a
time series to model future values of a times series [28]. This technique facilitates
the recognition of significant patterns and trends within the data, which can subse-
quently be leveraged to predict future values. This approach combines three different
models:

1. Autoregression (AR) - A model that uses past values of a time series to predict
future values.

2. Integrated (I) - A model that transforms a non-stationary time series (that is,
the mean, variance and autocorrelation is varying over time) into a stationary
one, by taking the difference between consecutive samples.

3. Moving Average (MA) - A model that uses the residuals of past predictions to
predict future values.

ARIMA is defined by three parameters, p, d and q, which respectively represent the
order of autoregressive, integrated and moving average parts of the model. The
time series is differenced d times, while p and q are the number of lagged (previous)
observations and residuals, respectively, to take into consideration.

The AR component can be represented as follows:

Yt = c + ϕ1 · Yt−1 + ϕ2 · Yt−2 + . . . + ϕp · Yt−p + εt (2.7)

where Yt is the observed value at time t, c is a constant term, and ϕ1, ϕ2, . . . , ϕp are
the autoregressive coefficients, which represent the weights assigned to the lagged
values. Yt−1, Yt−2, . . . , Yt−p are the lagged observations and εt is the error term
at time t. The differencing operation is defined as ∆Yt = Yt − Yt−1, performed
d times. The moving average component considers the relationship between the
current observation and the residuals from past predictions, represented as

Yt = µ + εt − θ1 · εt−1 − θ2 · εt−2 − . . . − θq · εt−q (2.8)

where µ is a constant term of the time series, εt represents the error term at time
t and θ1, θ2, . . . , θq are the moving average coefficients, which represent the weights
assigned to the lagged error terms.

Combining the AR, I, and MA components, the ARIMA model can be represented
as

Yt = c+ϕ1 ·Yt−1 +ϕ2 ·Yt−2 + . . .+ϕp ·Yt−p +εt −θ1 ·εt−1 −θ2 ·εt−2 − . . .−θq ·εt−q (2.9)

ARIMA has been widely used in various fields where time series are present and
future values are of interest. In an anomaly detection setting, ARIMA can be used

15


2. Theory

for forecasting future values of a time series. Anomalies, or unusual patterns in the
data, can be detected by comparing the forecasted values with the actual values.

2.6 Anomaly Detection
Anomalies, or outliers, refer to points or patterns in a dataset that deviate from the
expected or normal behavior. The appearance and cause of an anomaly can vary
depending on the data and application. The types of anomalies are described in [29].
These are as follows:

• Point outliers: A point outlier refers to a data point that exhibits abnormal
behavior at a specific point in time compared to either the other values in the
time series (global outlier) or its adjacent data points (local outlier). Point
outliers can be either univariate or multivariate, depending on whether they
affect one or more time-dependent variables, respectively.

• Subsequence outlier: A subsequence outlier, also called collective outlier relates
to a sequence of points whose combined behaviour is not expected. Each
individual point in the sequence outlier does not necessarily need to be a
point outlier.

• Contextual outlier: A contextual outlier is a type of anomaly that is dependent
on the context in which the data is observed. Unlike point and sequence
outliers, contextual outliers are defined based on their relationship to other
variables and the context in which they occur.

These types of outliers can be detected by various approaches, either by forecasting
or reconstructing the data. Anomaly detection models are generally unsupervised in
the sense that they are not provided with anomaly labels during training. Instead, a
sample is determined to be an anomaly based on its anomaly score, which represents
the discrepancy between the model output and the observations, in comparison to
an anomaly threshold.

2.6.1 Forecasting Based Anomaly Detection
Forecasting based anomaly detection models perform time series forecasting and
compare the predictions to the true values. If the error between the predictions
and the true value exceeds a certain threshold, then the point is considered as an
anomaly. Forecasting one point at a time typically identifies unexpected deviations
in the data constituted by point anomalies.

A forecasting based anomaly detection model is trained on normal data, that is,
data without any known anomalies. Hence, the model knows the normal behaviour
and consequently, abnormal data will not be forecasted well. By relying solely
on the forecasting error, labels of anomalies are not required during the training
phase, constituting an unsupervised model. This makes them useful for detecting
previously unseen types of anomalies, as the model can learn to recognize deviations
from normal patterns without being explicitly trained on anomalous data.

16


2. Theory

Common models used in forecasting based anomaly detection are ARIMA and
LSTM models.

2.6.2 Reconstruction Based Anomaly Detection
Reconstruction-based anomaly detection models operate by reconstructing a given
time series sequence and comparing it to the actual time series. The general idea
behind these models is that if the generated reconstruction closely matches the
input sequence, then the sequence is considered normal, otherwise it is anomalous.
To achieve this, the model is typically trained on a set of normal (non-anomalous)
time series sequences [30]. During training, the model learns to generate accurate
reconstructions of these sequences. Once trained, the model can then be used to
generate reconstructions of new input sequences. If the reconstruction error (that is,
the difference between the generated reconstruction and the input sequence) exceeds
a certain threshold, then the input sequence is flagged as anomalous [13]. Since
multiple values are reconstructed, subsequence or contextual outliers can be detected
without containing point outliers. Autoencoders combined with RNNs are commonly
used for time series reconstruction.

2.6.3 Anomaly Score Thresholds
Anomaly scores are computed by comparing the actual time series values to predicted
ones. Each data point in the test set is assigned a score, which reflects the quality of
the prediction. The lower the score the better the estimation [31]. To evaluate the
performance of a model, anomaly classifications are constructed by comparing the
anomaly scores to a threshold and classifying as anomaly if the score exceeds the
threshold. In our study, we have utilized the maximum training error, epsilon and
rolling median absolute deviation (MAD) methods to determine the threshold level
and subsequently classify the instances accordingly. The maximum training error
as threshold utilizes the fact that only normal data is present during training, thus
errors exceeding those experienced during training should be classified as anomalies.
Epsilon and MAD threshold methods require further explanation.

2.6.3.1 Epsilon Threshold Method

Suppose that we have the input data X = {x1, x2, ..., xN}, where each xi ∈ Rk is
a vector containing values for all k features. Also, suppose that we have trained
a model and predicted values X̂ = {x̂1, x̂2, ..., x̂N}. The errors ei for each feature
i ∈ 1, 2, ..., k, is then defined as

ei = |x̂ − x|i.

Computing the mean µ and standard deviation σ of the training errors for each
feature and time stamp t, each predicted value is classified as an anomaly if

ei,t ≥ µi + ϵσi, (2.10)

where ϵ ≥ 0.

17


2. Theory

Table 2.1: Classification outcomes after applying anomaly detection models.

Actual Prediction Result
Normal Normal TN
Anomaly Anomaly TP
Normal Anomaly FP
Anomaly Normal FN

2.6.3.2 Rolling MAD Threshold Method

Suppose the same scenario is available as in 2.6.3.1, the rolling MAD threshold
is obtained by studying the median errors in the training set in a sliding window
fashion. For a window size w, the threshold for each feature i is determined as

max
(
M(xi,t, xi,t+1, ..., xi,t+w)

)
, ∀t ∈ 0, 1, ..., N − w, (2.11)

where M(·) is the median and N is the number of samples.

2.6.4 Anomaly Detection Evaluation Metrics
Finding a meaningful evaluation metric when working with anomaly detection mod-
els is necessary since the distribution of normal versus abnormal points is usually
highly imbalanced [31]. Anomaly detection is an unsupervised method but based
on the reconstruction and/or forecasting error, classifications can be obtained by
comparing the error to the threshold. By comparing the classifications to the true
values, the outcomes presented in Tab. 2.1 can be obtained. Using the different
classification outcomes, the performance of a model can be determined by different
metrics.

Due to the highly imbalanced dataset it is not reasonable to use accuracy as a metric,
instead the following metrics will be explored.

Precision (P ) = TP

TP + FP

Recall (R) = TP

TP + FN

F1 = 2 P · R

P + R

(2.12)

Precision measures the proportion of correctly predicted instances out of all posi-
tively predicted instances. In this project, this corresponds to the ratio between true
alarms and the total number of alarms. High precision indicates that false alarms
are effectively reduced and enhances the reliability towards the model. Recall quan-
tifies the ratio of correctly predicted positive instances to the total number of actual
positive instances. This yields a fraction of anomalies successfully detected. High
recall indicates that the capabilities of the model towards capturing true anoma-
lies are enhanced. The F1 score offers a balanced evaluation of performance as it

18


2. Theory

takes into account both precision and recall. A high F1 score reflects the ability to
accurately capture true anomalies while also minimizing false alarms.

The Precision-Recall Curve (PRC) provides a visualization of the trade-off between
precision and recall in a classification model. The construction of the curve involves
computing precision and recall at various classification thresholds. This is achieved
by varying the thresholds, calculating precision and recall for each threshold, and
subsequently plotting the precision-recall pairs on a graph to form the PRC.

The Receiver Operating Characteristic (ROC) follows the same construction scheme
as PRC but instead visualizes the trade-off between TP rate and FP rate for the
model. By computing the Area Under the Curve (AUC) score, a measurement of
the model’s performance is provided. The score ranges from 0 to 1 where a higher
score indicate a more accurate model.

19


2. Theory

20


3
Methods

This section outlines the methodology employed for this project. The section begins
with an understanding and exploration of the data to describe its characteristics
and patterns. In order to optimize the data for modeling, various steps are con-
sidered, including data cleaning, feature extraction, selection, and transformation.
Several models, such as a simplistic baseline model, MTAD-GAT, Autoencoder, and
ARIMA, are described and compared in terms of their applicability and performance.
Lastly, potential pitfalls in the data and modeling processes are identified and dis-
cussed for full transparency regarding the robustness and reliability of our findings.

3.1 Data Understanding
In this section a brief introduction to the datasets is presented. We describe how
the data is extracted and key properties of the datasets.

3.1.1 Data Description
The signals from production machines and their integrated sensors are stored in a
data historian, optimizing its efficiency by only logging values when a change of
certain magnitude occurs. Since this happens at various rates for different variables,
the continuous variables are interpolated between timestamps while the discrete
variables keep their value until change to obtain the data on similar frequency. The
data is extracted from two machines of the same type, denoted machine 1 and 2,
respectively. The motivation for the chosen machines is recently installed vibration
sensors, enabling similar data collection from both machines. Also, using machines
of the same type serves the purpose of investigating generalization possibilities, since
they share the same set of tools, enabling training on one machine and testing on
the other.

The data is collected at a frequency of one observation per second, covering the time
period from 2022-05-17 to 2023-03-14, corresponding to the time of the installation
of the vibration sensors until the data collection phase of this thesis. During the
extraction stage, data pertaining to periods of production downtime is filtered out.
For both machines, the same subset of signals is extracted. These signals contain
vibrations and machine signals such as material ID, production signal etc. An
excerpt of the signals, together with a short description can be found in Tab. 3.1.

21


3. Methods

Table 3.1: Important signals: name and description.

Signal name Description [unit]

MDC_MaterialID Material ID [-]
MDC_ProductionSignal Production signal [-]
MDC_ToolNbr1 Current tool in spindle [-]
PDA_Actual_Spindel_RPM Rotational speed of the spindle [rpm]
Object_1_value Vibration x-joint [milligravity]
Object_2_value Vibration y-joint [milligravity]
Object_3_value Vibration z-joint [milligravity]
MDC_AlarmID1-5 Machine alarm. 3190 if tool malfunction. [-]

The signals in their entirety are presented in Tab. A.1 in Appendix A. According
to literature and expert knowledge at the company the vibrational data are the key
signals.

3.1.2 Data Exploration
In total, 21 columns are extracted from both machines but the number of rows
differs due to individual production stops.

Exploring the data on tool level, one observation reveals that for machine 1 there
are 31 unique tools containing at least one tool breakage. The same number for
machine 2 is 29. In Fig. 3.1 the number of malfunctions for each tool for the
machines, respectively, are displayed. Tools with many malfunctions, either for a
single machine or in combination, are: 2112, 2345, 4304, 4503 and 4507. These
tools will from here on form a subset on which the experiments are performed.
Analyzing the number of data points for each tool, in comparison with the number
of malfunctions, seen in Tab 3.2, reveals a scenario of highly imbalanced data sets.
Since the data set for machine 2 and tool 4503 has the most number of occurrences
of malfunctions, it is from here on denoted D1 and will be used when comparing
models. Another takeaway is that the usage of the tools seems similar for both
machines, since the total data points are of comparable magnitude, for a given tool.

To get an understanding of the vibrational behaviour around the malfunctions for
D1, Fig. 3.2 and Fig. 3.3 visualize different patterns. In Fig. 3.2, there is a visually
clear peak in close relation to the malfunction detection. This spike could illustrate
the actual malfunction event or an event leading up to the malfunction. On the other
hand, Fig. 3.3 displays no clear oddity before the tool breakage. This indicates that
detecting such tool breakages needs to be achieved with other methods than finding
point outliers. In total, 7 of the 17 malfunctions in D1 are preceded of a high
spike, similar to the first figure. The remaining 10 are of different character, not
preceded by abnormally large values. These characteristics are also verified when
looking at a visual summary of the data in Fig. 3.4, where we see that the overall
distribution remains consistent between the entire dataset and data closely related
to malfunctions, in which both the most extreme outliers and normal behaviour is

22


3. Methods

(a) Machine 1.

(b) Machine 2.

Figure 3.1: Number of malfunctions per tool and machine.

Table 3.2: The number of malfunctions for each tool and machine, together with
the total number of data points.

Tool Machine Malfunctions Total data points

2112 1 13 49 158
2112 2 12 26 955
2345 1 10 110 523
2345 2 7 99 544
4304 1 7 8 534
4304 2 6 5 633
4503 1 6 60 608
4503 2 17 67 664
4507 1 10 124 583
4507 2 6 115 662

23


3. Methods

Table 3.3: The enumerated sequences preceded by one or multiple abnormal values.

Preceded by extreme values Sequences

YES 4, 6, 7, 8, 11, 14, 15
NO 0, 1, 2, 3, 5, 9, 10, 12, 13, 16

Figure 3.2: Time series data around a malfunction, with an abnormal value prior.

represented. Tab. 3.3 enumerates which characteristic each malfunction expresses.
To simplify referencing, sequences in the Yes-row are referred to as group 1 while
the remaining constitute group 2.

3.2 Technique Exploration
In order to be able to use a regression model predicting RUL, a target variable needs
to be created with the true remaining life for each input. Thus, it is necessary to
meet certain data requirements. Specifically, accurate data is required for three main
aspects: firstly, it is required to have indications for which of the master and the
sibling tool is being used. Secondly, we need indications for when a tool is replaced.
Lastly, the theoretical remaining life should be present in order to evaluate the
performance of the model.

To investigate the feasibility of this information, relevant features are studied. In-
vestigations are performed when the machine is in production, for a selected tool
and a contiguous sequence of data points.

For an anomaly detection approach, high-quality data is necessary. Lack of quality
can be estimated by analyzing the input data for erroneous entries, such as missing
or faulty signals.

24


3. Methods

Figure 3.3: Time series data around a malfunction, without outlier.

(a) Boxplot of the vibration signals on full dataset.

(b) Boxplot of the vibration signals in close relation to
malfunctions.

Figure 3.4: Boxplots for the vibration signals in the full dataset and in close relation
to malfunctions.

25


3. Methods

3.3 Data Preprocessing and Feature Engineering
To effectively use the extracted data in a modeling setting, multiple preprocessing
steps need to be applied. Pandas library in Python is a useful tool when working with
large datasets. The preprocessing steps include data cleaning, feature extraction,
feature scaling and feature selection. A description of these steps is provided in this
section.

3.3.1 Data Cleaning
From the company database, only data during production or when an alarm indi-
cates a malfunction is extracted. Also, data not corresponding to the five tools 2112,
2345, 4304, 4503 and 4507 are removed. Additionally, cleaning includes manipulat-
ing or removing faulty or uninformative data. Any row containing NaN value for
the vibrational data is removed, corresponding to logs prior to sensor installation.
The vibrational data is the core data and thus accuracy of such is important, hence
the missing values are not replaced with any imputed value, such as the mean or
median. However, for other features, missing or faulty data are imputed with ap-
propriate values. For instance, a missing material ID is replaced with -1, since it is
a categorical variable.

Due to the characteristics of the data extraction, coherent timestamps after a mal-
function hold the same information as the first row when the malfunction is detected
until the tool is replaced and production restarted. Thus, such rows are removed.

Additional cleaning is performed after the train and test split is performed. Since
any anomaly detection model should be trained on normal data, it is important
that the training set does not contain faulty values. Thus, extremes and periods
with missing data logs, resulting in an almost constant interval, denoted bridge, are
removed. An example is seen in Fig 3.5 where both extremes and a bridge are
removed, displaying the difference between the original and cleaned data.

3.3.2 Feature Extraction
New features are created based on the original data to simplify relevant information
extraction from the dataset. The column ToolMalfunction is created and has
value 1 if any of the five alarm IDs holds the value 3190, else it is 0. Also, a
feature NewSequence is derived based on the time gap between two consecutive
rows when isolating the data on tool level. It is 1 if the gap exceeds one second,
else 0. The reason is due to the condition check after each use of a tool. A larger
time gap than one second would imply that another tool has been used in between
or that the production was idle. Theoretically, both cases would infer that laser
measurements have been performed and the condition of the tool is known. Thus,
when a malfunction is detected, the interval where the actual breakage can have
happened is limited to the time between the latest new sequence and the detection
of the malfunction.

To evaluate anomalies detected, a column AnomalySequence is constructed with

26


3. Methods

Figure 3.5: Example of how the two cleaning steps can be applied to the data,
removing extreme values and bridges.

the value 1 for a sequence containing a malfunction, that is all rows between an
indication of a malfunction and its corresponding new sequence. Otherwise the
value is 0.

3.3.3 Feature Selection

To exploit the information stored in the data in the best way it is necessary to
understand dependencies and relations between features. To reduce memory alloca-
tion and achieve computational speed-up, a subset of the features are selected. The
key signals are the vibrational data, which are to be reconstructed or forecasted.
In addition, features that might influence the behaviour of the vibrations are also
extracted. By visualizations, material ID is observed to infer a change in behaviour
characteristics of the vibrations. Fig. 3.6 displays an example of the mentioned
relation. Thus, material ID is added to the selection of features. Also, the spindle
RPM is selected since it determines the strain of the tool.

3.3.4 Data Transformation

The selected features are transformed in appropriate ways to suit a machine learning
model. Each feature except material ID is standardized, according to x̃i = xi−µi

σi
,

where µi and σi are the mean and standard deviation of feature i.

For material ID, scaling the values is not adequate since the information does not lie
in its magnitude. Material ID is a categorical variable, where each discrete value is
unrelated to the other values. Thus, a binary representation is created using one-hot
encoding.

27


3. Methods

Figure 3.6: A sample sequence showcasing the impact Material ID has on the vibra-
tion signals.

3.4 Model Comparisons
The models that will be evaluated are a baseline model, an ARIMA model, an au-
toencoder, and MTAD-GAT. To compare the performance between different models,
a general setting is needed to proceed from. In this section, the general setting is
described, followed by the specific settings for each model.

3.4.1 General Description
The comparison between models is conducted on the data from a single machine and
tool, specifically machine 2 tool 4503, since it has the largest record of tool break-
ages. The behaviour around the malfunctions is of different characters, displayed
previously in Fig. 3.2 and 3.3. For this specific tool, 7 malfunctions are preceded by
a high spike, similar to the first figure. The remaining 10 are of a different character,
not preceded by an extremely large value. Thus, it is determined to be the best
subset of the data to perform the comparison on, exposing the models to multiple
malfunction scenarios. Before comparing between models, the optimal settings for
each model are determined based on the mentioned dataset.

In general, the data is split into training and testing sets, with 50% of the data in
each. The models are trained on a normal representation of the data. To achieve this,
the data is split in such a fashion that the training data contains as few malfunctions
as possible, still being coherent. Even though this can create a discontinuity in the
test set, this is an efficient way to keep as many malfunctions as possible in the test
set while minimizing the number of discontinuities in both sets. Any malfunctions
still within the training set are removed together with the previous 300 time steps,
to be sure of not keeping any data point related to a malfunction. On average, the
sequence length for the selected data set is 20 seconds. This means that this tool is

28


3. Methods

being used on average 20 seconds at a time. The testing data is transformed based
on the parameters obtained from transforming the training set.

For univariate models, the vibrational data in x-, y- and z-joint are used, respectively,
as input and output. For multivariate models, all features are used as input features
simultaneously, outputting the vibrations mentioned above. For each output feature
and time step, an anomaly score is computed, as the difference between the output
and the true value. In a reconstruction and forecasting scenario, the sum of the
errors constitutes the score. Different approaches are investigated, one where the
mean of the errors for all features within each time step constitutes a global anomaly
score and another where the anomaly scores are kept separated for each feature. The
errors are compared to a threshold, specific for the selected threshold method, and
labeled as an anomaly if exceeding the threshold. The labels for the individual
features can be aggregated by an AND or OR operation. For the AND operation,
only time steps where the score of each feature exceed the threshold will be labeled
as an anomaly. For an OR operation, it is sufficient if only one feature is classified
as anomaly to classify the sample as anomalous.

The true labels are constituted of the anomaly sequences, where all instances within
each sequence are labeled as anomalous. In addition to precision, recall, and F1
scores, the heads-up, that is, the average time between the first detected instance
in an anomaly sequence and the time of actual tool malfunction detection, are
measured.

For each model, we first compare the general capabilities in ROC and PRC plots.
This will indicate how the models perform in terms of classification accuracy and
precision-recall trade-off. The area under the ROC curve will provide insights into
the overall ability to discriminate between positive and negative samples, while the
area under the PRC curve will indicate the performance in terms of precision and
recall balance. Thresholds regarding the pointwise global anomaly scores and a
rolling median approach, MAD, are investigated. Then, we look at the most promis-
ing models and the baseline model, evaluating them in the use-case of detecting
anomaly sequences in production. For this case it is sufficient to flag one time step
in a sequence as anomalous to accept the anomaly sequence as detected. Hence, in
such a scenario a TP reflects a detected sequence, not an instance. The potential of
each model is compared by evaluating different metrics. Primarily, a high F1 score
is wanted, but also a model able to find many anomaly sequences (high recall) with
reasonably few FPs is investigated to study the behaviour of FPs in an attempt to
increase the precision.

3.4.2 Baseline Model

The baseline model is created as a univariate model that identifies all values in the
test set that exceeds the largest training value as anomalies.

29


3. Methods

Figure 3.7: MTAD-GAT architecture, image from [14].

3.4.3 MTAD-GAT
MTAD-GAT is a deep learning-based framework for anomaly detection in multivari-
ate time-series data, with the aim of both reconstructing and forecasting the input
data [14]. The implementation of the model was obtained from an open source
GitHub repository1, with MIT license. This license grants the freedom to use, mod-
ify, and distribute the code for commercial purposes, while ensuring that the original
copyright notices and disclaimers are included in any copies or modifications of the
software.

Comparing to the original model architecture, the VAE is removed and instead the
GRU layer is regarded as an encoder layer and an RNN decoder layer is added to
reconstruct the data. Also, modifications are made to enable feeding the model with
all features but only using the subset of output-features as targets for forecasting
and reconstructing.

The multivariate time series data is transformed into a sliding window dataset to
be fed into the model. Specifically, overlapping sub-sequences of fixed-length are
created by splitting the data according to the sliding window with stride one. Each
sub-sequence constitutes a sample to use during training or testing. By doing so,
temporal dependencies within each window could be captured. The size of the
sliding window is determined empirically.

The transformation is made using utilities in PyTorch using the Dataset and
DataLoader classes. First, the time series data is loaded into a PyTorch tensor.
Then, we use the Dataset class to create a dataset of sliding windows. We use the
DataLoader class to create an iterator over the dataset, which allows to efficiently
load and process the data in batches during training.

The model expects the data to be on sliding window form, with shape (batch size,
window size, number of features). Going from left to right, a filter convolves over
the input data layer filtering out noisy and irrelevant information. Then in parallel,
the input data is transformed to a time oriented and feature oriented graph respec-
tively, where the nodes in the graph represent data points and the edges represent
relationships or connections between the data points. The resulting dependencies

1https://github.com/ML4ITS/mtad-gat-pytorch

30


3. Methods

Figure 3.8: A symmetrical LSTM autoencoder architecture with 3 LSTM layers
with sizes 64, 32 and 16.

are concatenated together with regular input data and fed into a GRU layer to cap-
ture long term dependencies. The output from the long term dependencies are fed
into forecasting and reconstruction based modules in parallel. The reconstruction
module recreates the data to the same shape as the input data, while the forecasting
module predicts future values. Separate losses are computed for the reconstruction
and forecasting modules in the model. The model is optimized using a joint loss
function, enabling simultaneous learning of both reconstruction of the past and fore-
casting of the future. The selection of hyperparameters is visible in Tab. B.1, in
Appendix B. With this selection of hyperparameters, the architecture has around
400 000 trainable weights.

3.4.4 LSTM Autoencoder
The autoencoder model is a reconstruction-based model and takes the same input
and output features as MTAD-GAT. The shape of the input and output data also
corresponds to the same shape described previously. In order to compare the per-
formance of the models incorporating reconstruction, the same window size for the
autoencoder is used as for MTAD-GAT. The autoencoder is implemented in Keras,
consisting of an encoder with 3 LSTM layers, with sizes 64, 32 and 16. The bot-
tleneck is represented by a RepeatVector layer. This layer repeats the output of
the previous layer (i.e., the last LSTM layer in the encoder) multiple times, so that
the resulting tensor has the same shape as the expected input of the decoder. The
decoder also consists of 3 LSTM layers, with sizes 16, 32 and 64. An overview of
the architecture can be seen in Fig. 3.8.

The model is trained using the Adam optimizer with a learning rate of 0.001. The
training process lasts for 200 epochs. Early stopping is implemented, which means
that the training will stop if the weights have not improved for 25 consecutive
epochs. In such cases, the best weights achieved so far will be saved as the final
model. Furthermore, this architecture has roughly 80 000 learnable parameters.

31


3. Methods

3.4.5 ARIMA
To build the ARIMA model, the Python packages pdmarima and statsmodels are
used.

One model per feature is fitted using the training data. The parameters (p, d, q) are
empirically found to be (1, 0, 1), (2, 0, 1) and (3, 0, 3) for each feature, respectively.
For each model, a rolling forecasting approach is utilized, meaning that for each
time step in the test set, the value of the next time step is forecasted and the true
value is appended to the model. This process is repeated iteratively until all the
values in the test set have been predicted.

3.5 Evaluation
After comparing the performance of the different models, we select the best perform-
ing one to further investigate its generalization capabilities. Specifically, we train
the selected model on data from one tool and evaluate its performance on data from
another tool or machine. This shows the ability of the model to detect anomalies
across different equipment and operating conditions.

The results obtained from this analysis will be evaluated in the context of the current
workflow and decision-making processes of the company. For example, we will assess
the potential benefits of using the proposed model to detect malfunctions earlier than
the current approach and how this may translate into improved equipment reliability,
reduced downtime, and cost savings.

3.6 Potential Pitfalls in the Data
When working with data from a real-world scenario, it is important to be aware of
potential pitfalls that may affect the performance of the models. For this project,
pitfalls can be related to the available data or the interpretation and implementation
of production procedures.

One potential source of error is related to the material ID. As shown in Fig. 3.6,
material ID affects vibrations. However, analyzing the training and test sets, only
between 23% to 63% of material IDs within the test sets have been seen during train-
ing. For unseen material ID, the models can not utilize any relation between that
material and the vibrations, possibly having a negative impact on the performance.

Another potential issue is if the data contains inaccuracies such as outliers or bridges
that are not related to malfunctions. Mentioned data are not removed from the test
set since the test data should reflect a real-world scenario and removing it would
falsify such an objective. However, such data will reasonably raise false alarms,
affecting the results of the model.

Regarding the interpretation of production procedures, the actual break point for a
tool should only be able to occur within the last sequence before a detected malfunc-
tion. However, suspicious behaviour can be seen in Fig. 3.9 where a vibration signal

32


3. Methods

Figure 3.9: A sample from a vibration signal acting suspicious a couple of sequences
before the malfunction was found. The dashed lines indicates a new sequence.

has an abnormally high value a couple of running cycles before the actual malfunc-
tion was found. If this behaviour in fact is related to the upcoming breakage, the
prerequisites change.

33


3. Methods

34


4
Results

In this chapter, the results regarding the prospects of each technique, metrics out-
putted from the models and differences between the current workflow and a scenario
where the model is used, are presented.

4.1 Comparison of Approaches
In this section the analysis of which approach should be used to try to predict
malfunctions is presented. To build a regression model and predict RUL, labels
of the true remaining life are needed. For anomaly detection, high quality data is
required.

4.1.1 Predicting RUL
The labeled data needed for a supervised scenario should be the true remaining life,
that is the time until malfunction, for any tool at any given point. Such data does
not exist and needs to be manufactured. To achieve this, data of the exact time of
malfunction, when a new tool is inserted, and which of the master and sibling tool
is operating are needed. For evaluation purposes, values of the current estimate of
RUL are wanted. At our disposal we have the features PDA_Tool_Life_Counter,
PDA_Maximum_Tool_Life and PDA_Remaining_Tool_Life, from here on denoted tlc,
tml and trl, respectively.

The tml holds the expected life length, assigned for each tool. The tlc counts the
time that a tool has been used in production. The difference between tml and
tlc constitutes trl, All values are presented in seconds. From discussions with the
company, it is assumed that a new tool should be indicated by a reset of tlc, that is
tlc = 0. Also, switching between the sibling and master tool should be indicated by
a discontinuity, from where the life counter then increases from. By starting from
each malfunction and calculating backward until the insertion point of that tool is
encountered, taking any switches between tools into consideration, the true RUL
can be obtained.

However, this approach is infeasible. To start with, the exact time of malfunction
is unknown. Only the time for detecting the malfunction is collected, hence only
approximations of the true remaining life can be manufactured. In addition, the tlc

is too unpredictable to even make useful approximations. It does not incorporate

35


4. Results

Figure 4.1: The life counter for a contiguous period when the machine is in produc-
tion, for a specific tool. The left figure portrays the extremely large values and the
right figure demonstrates temporary resets of the counter.

Table 4.1: Frequency for different values of maximum life time for the chosen, con-
tiguous sequence of 1000 points.

Value Frequency

1800 965
0 15

9000 12
18000000 8

the expected, piece-wise linearity and tml is not constant, for a specific tool. To
demonstrate this, data from an active production, sorted on tool level, is extracted.

For a representative, contiguous sequence of 1000 seconds, tlc contains multiple
abnormally high values, displayed in Fig. 4.1. These high values could correspond
to faulty data recordings. During the selected sequence, the counter is also zero
multiple times, which would indicate a new tool. However, Fig. 4.1 proves that
the counter is only temporarily reset, since the following values correspond to those
prior the reset. By the construction of the signal, this infers that invalid data has
occurred. Even removing such data would not result in the expected behaviour,
since tlc only changes when the tool is actively operating on the material and not
when production signal is high.

Despite the expectation of a constant tml, it shows unexpected fluctuations and
its frequency is presented in Tab. 4.1. According to expert knowledge, the true
value should be 1800 and the remaining values could be due to unsynchronized data
recordings, inheriting the values from preceding tools, or manual overrides performed
by operators.

Following from the demonstrated behaviour of tlc and tml, trl has equal flaws. Alto-
gether, manufacturing labels of high quality is infeasible, disqualifying the possibility
of RUL prediction.

36


4. Results

(a) Learning rates. (b) Weight decays.

Figure 4.2: Validation loss for MTAD-GAT trained on tool 4503 and machine 2 for
different learning rates and weight decays.

4.1.2 Anomaly Detection
The requirements to successfully implement an anomaly detection model are not
as extensive as for a model predicting RUL. The necessary time-series data exists
in terms of vibrational data. This data should be reliable and complete. After
processing the data as described in chapter 3, this is assumed to be fulfilled. Data
regarding the actual life of a tool and the estimated remaining life is not necessary.
A true label of anomalies eases the evaluation and can be constructed by utilizing
the stored points when a tool malfunction is discovered together with the created
feature NewSequence, as described in 3.3.2. By acting accordingly, all requirements
for using anomaly detection are fulfilled.

4.2 Model Comparison
In this section, the performance of MTAD-GAT, Autoencoder, ARIMA and a base-
line model in detecting anomalies will be compared on D1. After tuning parameters,
we first compare the general capabilities of the models in ROC and PRC plots. Then
the most promising models are evaluated from a use-case scenario.

4.2.1 Hyperparameter Tuning
For MTAD-GAT, empirical studies are performed to determine the values for learn-
ing rate, weight decay and window size. The loss is compared for different parameter
values and generally the value resulting in the lowest loss is chosen. The learning
rate and weight decay parameters are set to 0.001 and 0, respectively, based on
the findings displayed in Fig. 4.2. For different window sizes the reconstruction
and forecast loss are displayed individually. Using a smaller window size does not
influence the forecast ability, illustrated by Fig. 4.3a. However, using fewer points
during reconstruction reasonably generates a lower loss, verified in Fig. 4.3b. Even
though a window size of 5 provides an overall lower loss, the results for window size
10 are also included to avoid filtering out any distributional properties detected with
a larger window size.

37


4. Results

(a) Forecast loss. (b) Reconstruction loss.

Figure 4.3: Forecast and reconstruction loss, respectively, for MTAD-GAT trained
on tool 4503 and machine 2 for different window sizes.

4.2.2 ROC and PRC
For ROC and PRC plots, the global anomaly scores and rolling MAD scores are
evaluated respectively. Each point is classified as positive if its error exceeds the
threshold, regardless of whether other points within the same anomaly sequence
have been detected or not. In total, there are 311 points corresponding to the 17
anomaly sequences. For ROC, this provides a general investigation of how accurately
the models can separate between the points corresponding to the anomaly sequences
and normal points. Observe that this approach implies an assumption that all points
within the anomaly sequence are anomalous in order to enable detection.

Fig. 4.4 displays the results for both global errors and MAD errors. Closer inspec-
tion of 4.4a indicates that autoencoder10 is most profound in separating the classes
for the global errors, even though the performance between all MTAD-GAT and
autoencoder models are comparable. The results indicate that a fairly high share of
TPs can be detected at a low FP rate, using these models. The baseline, comparing
the actual vibration values to the maximum value seen in training, is outperformed
by all other models, indicating that solely relying on large vibrational values is not
sufficient. Focusing on the ROC for MAD errors, displayed in Fig. 4.4b, the same
models are eminent, with an increased AUC score for all models except the baseline.
This implies that forecasts and reconstructions during anomaly sequences are gener-
ally exposed to higher distributional errors. On the other hand, the median values
of the vibrations are not explicitly increasing during anomaly sequences, as proved
by the reduced AUC score for the baseline. However, it should be noted that the
good performance of the ROC plots should be attributed to the highly imbalanced
nature of the dataset, proved by the PRC plots.

The PRC plots, visualized in Fig 4.5, show the trade-off between precision and recall
for the same errors as in ROC. Here the AUC scores are clearly reduced, proving
the effect the imbalanced dataset provides. Since the number of negative (normal)
samples are much larger than the positive ones (anomalous), even a small fraction
of FPss will heavily punish the precision rate since there are so few TPs. This is
illustrated in both PRC cases, for global anomaly scores (Fig. 4.5a) and MAD errors
(Fig. 4.5b), where both precision rates approach zero for a low recall score. Overall,
autoencoder10 demonstrates the best performance amongst the models. It yields a

38


4. Results

(a) For global anomaly score.

(b) For MAD errors.

Figure 4.4: ROC for all models. The subscripts in the legends indicate the window
size for reconstruction.

39


4. Results

higher precision rate for recall scores up until approximately 0.4. However, from the
figure it is impossible to validate which sequences are detected.

ROC and PRC provide general transparency to the models’ capability, but it is also
interesting to evaluate the models from a use-case perspective, analyzing how well
and which anomaly sequences are detected by the models.

4.2.3 Use-case Scenario
From a use-case perspective, the number of detected anomaly sequences are of in-
terest rather than individual instances. In Tab. 4.2 a comparison between MTAD-
GAT5, autoencoder10 and the baseline is presented. The settings for each threshold
method are distinguished by the subscripts. A threshold with subscript feat classi-
fies the instances for each feature, aggregating the final classification by an AND or
OR operation. The global subscript indicates that the classifications are obtained by
using the chosen threshold method on the global anomaly score. For the ϵ-method,
the subscripted number denotes how many standard deviations are being used. The
table also presents an enumeration of the detected sequences, to evaluate which type
of sequences can be detected, in comparison with group 1 and 2.

For the thresholds yielding the highest F1 score for each model, respectively, both
the baseline and MTAD-GAT only detect sequences in group 1. The heads-up score
is the same across both models, indicating that both probably detect the same
values within the anomaly sequences. MTAD-GAT avoids some of the FPs that the
baseline classifies, but in general the performance is comparable. The autoencoder10
using maxglobal obtains the highest F1 score, managing to detect 6 anomaly sequences
from group 2, in addition to all in group 1. This could indicate that there are large
enough distributional errors within anomaly sequences belonging to group 2 that
the model is capable of detecting.

Lowering the threshold to detect all sequences results generally in an increment in
FPs. Also, such an action could incorrectly classify instances in the training set as
anomalies. The baseline has to use 1 standard deviation, while the same number
for MTAD-GAT and the autoencoder are 6 and 9, respectively. The low threshold
needed for the baseline proves that the anomaly sequences in group 2 does not
contain any relatively large values. As a result, the baseline misclassifies thousands
of instances. The higher ϵ-values for MTAD-GAT and the autoencoder informs us
that larger errors within all anomaly sequences exist. However, in terms of FPs,
the autoencoder outperforms MTAD-GAT, classifying only 93 points wrongly as
positive. However, using this threshold yields multiple FPs in the training data,
disqualifying it from further investigation.

Using MADglobal as threshold ensures that no FP within the training data is detected.
In the test set, 12 sequences are detected together with 45 FPs. Worth noticing is
that sequence 8, belonging to group 1, is not detected. This could indicate that
sequence 8 consists of short, if any, sequences of succeeding unexpected values. Like
the performance using maxglobal, this threshold can successfully identify multiple
anomalies belonging to group 2 without an explosion of FPs.

40


4. Results

(a) For global anomaly scores.

(b) For MAD errors.

Figure 4.5: PRC for all models. The subscript in the models’ name indicates the
window size for reconstruction.

Table 4.2: Model evaluation in a use-case scenario, for different thresholds.

Model Threshold Agg. operation TP FP Precision Recall F1 Heads-up Detected seq.

Baseline maxglobal - 7 5 0.58 0.41 0.48 8.57 4, 6-8, 11, 14, 15
ε1,global - 17 2880 0.01 1.00 0.01 12.00 0-16

MTAD-GAT5

maxfeat AND 7 2 0.78 0.41 0.54 8.57 4, 6-8, 11, 14, 15
ε6,global - 17 630 0.03 1.00 0.05 13.65 0-16

Autoencoder10

maxglobal - 13 10 0.57 0.76 0.65 9.38 0, 4-15
ε9,global - 17 93 0.15 1.0 0.27 9.88 0-16

MADglobal - 12 45 0.21 0.71 0.32 4.67 0, 4-7, 9-15

41


4. Results

Figure 4.6: True positive classification.

Visualizing the reconstructions for cases where TP, FN and FP occur with the au-
toencoder can further explain performance and areas of improvement. In Fig. 4.6,
the reconstructed values and the true values, along with the corresponding error
between them, are presented. The yellow rectangle represents the true anomaly
sequence, while the blue rectangle indicates predicted anomalies. Hence, overlap-
ping rectangles constitute a detected anomaly. The rectangle areas are extended
by 10 points on both sides for visual clarity. It is evident that the autoencoder
accurately reconstructs the data, and deviations from the expected pattern result
in higher errors yielding positive predictions. However, some of the sequences are
not detected, especially sequences belonging to group 2. The reconstruction errors
of such sequences are visualized in Fig. 4.7. Despite not surpassing the threshold,
the errors for some of the sequences are noticeably different from the rest.

When depicting FP classifications, there are cases where these instances occur re-
lated to a true anomaly sequence. Examples can be seen in Fig. 4.8. Such behaviour
can be found in both MAD and pointwise errors, raising the cause for further FP
analysis in the use-case scenario.

4.2.4 False Positive Analysis
In the use-case scenario a collective approach is used, treating every cluster