Applying software engineering principles
to develop parcel delay forecasting mod-
els using tracking data

A study of the models ARIMA, BSTS, and GAM

Master’s thesis in Computer science and engineering

ALEX TANG
JOONAS HALMKRONA

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2022


Master’s thesis 2022

Applying software engineering principles to
develop parcel delay forecasting models using

tracking data

A study of the models ARIMA, BSTS, and GAM

ALEX TANG
JOONAS HALMKRONA

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2022


Applying software engineering principles to develop parcel delay forecasting models
using tracking data
A study of the models ARIMA, BSTS, and GAM
ALEX TANG
JOONAS HALMKRONA

© ALEX TANG, JOONAS HALMKRONA 2022.

Supervisor: Richard Torkar, Computer Science and Engineering
Advisor: Daniel, Company X
Examiner: Robert Feldt, Computer Science and Engineering

Master’s Thesis 2022
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Components of the fitted BSTS model over region DEA with service type
DHL Parcel Connect.

Typeset in LATEX
Gothenburg, Sweden 2022

iv


Applying software engineering principles to develop parcel delay forecasting models
using tracking data
A study of the models ARIMA, BSTS, and GAM
Alex Tang
Joonas Halmkrona
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The fast growth of data in the supply chain industry combined with the development
of more sophisticated data science tools has amplified the possibilities of actors in
this industry to leverage their data. At the same time, industries need to apply best
practices for data science programming to ensure software quality. In accordance
with the digitalization in the supply chain industry, this thesis aims to explore the
possibilities of utilizing tracking event data in parcel delay forecasting. That is, to
forecast the share of late packages in various geographical regions. Three different
models have been applied for time series analysis where both the predictive ability
of the models and the significance of the tracking data is studied. The three models
studied in this thesis are Auto-regressive integrated moving average models (ARIMA),
Bayesian structural time series (BSTS), and Generalized additive models (GAM).
During the development of the models, various tools and techniques were applied to
follow software engineering principles. The tools and techniques used was compared
to what tools and techniques used currently at Company X.

The research was conducted in collaboration with Company X in a field study that
aims to both elaborate on the studied models but also compare practices performed
in industry and practices suggested by formal theory research regarding software
engineering principles in data science programming. Results show that out of the
three models, the best performing model is BSTS and the second best performing
model is the GAM. This is expected, since the ARIMA model is, in essence, a
sub-model of the two other classes of models. Although the different models have
different forecasting capabilities, the quality of the predictions depend on the regions
inherent variance in the target variable.

The results also show that there is not a very clear correlation between the features
created from the tracking-event data and the target variable although some features
stand out more than others. The mixed-effects between variables also seem to have
greater predictive ability than the independent contributions of the in-going variables.
Furthermore, the results show that Company X makes a distinction between the
phases of data exploration and software delivery in data science programming. This
distinction is not found during the formal theory research. Although the practices
found in the formal theory research are followed by Company X, creating a distinction
between the two phases allows different practices to be followed at different stages of
a data science project.

v


To ensure software engineering principles during data science programming, Company
X needs to separate data exploration from software delivery and create a clear
framework with what tools and techniques should be used at what stage.

Keywords: Time series, ARIMA, BSTS, GAM, software engineering, software engi-
neering principles, supply chain.

vi


Acknowledgements
We would like to first and foremost thank our supervisor Richard Torkar for your
guidance and lasting support throughout the thesis. Further, we would like to extend
our sincerest gratitude towards company supervisor Daniel who has been of great
help during the thesis providing support and guidance throughout the project. Thank
you for your patience, your insights, and your knowledge. Vincent, Rigon and Viktor,
thank you for helping us during the project and giving us insight and inspiration
throughout it.

Alex Tang & Joonas Halmkrona, Gothenburg, June 2022

viii


x


Contents

List of Figures xv

List of Tables xvii

1 Introduction 1
1.1 Statement of the problem . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Project case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Technical environment . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 7
2.1 Delivery time estimation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Auto regressive integrated moving average models . . . . . . . . . . . 9
2.4 Bayesian structural time series models . . . . . . . . . . . . . . . . . 10

2.4.1 Structural time series . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Spike-and-slab . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . 12

2.5 Generalized additive models . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Application areas for ARIMA, BSTS and, GAM . . . . . . . . . . . . 13
2.7 Software engineering principles . . . . . . . . . . . . . . . . . . . . . . 14

2.7.1 Strategic engineering areas for data science programming . . . 14
2.7.2 The need for software engineering principles in data science

programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.3 Software engineering principles in data science programming . 16

3 Methods 21
3.1 Research procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Model development methods . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 BSTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 GAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Missing time series data . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Missing data statistics . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Model evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . 30

xi


Contents

3.5 Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Root mean squared error . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Mean absolute error . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.3 Mean absolute scaled error . . . . . . . . . . . . . . . . . . . . 32

3.6 Implementing a software engineering framework . . . . . . . . . . . . 33
3.6.1 Understanding the current data science development process . 34

4 Results 35
4.1 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Models in Region DE1 with service type A . . . . . . . . . . . 37
4.1.2 Models in region DEA with service type A . . . . . . . . . . . 39
4.1.3 Models in Region DE7 with service type B . . . . . . . . . . . 41

4.2 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Applying software engineering principles in data science programming 47

4.4.1 Created software . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Software engineering principles followed in this thesis when developing

the data science models . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.1 Consistent coding . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.2 Automate task performances . . . . . . . . . . . . . . . . . . . 49
4.5.3 Incremental coding . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.4 Avoiding repetition . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.5 Testing Notebooks . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.6 Software Optimization . . . . . . . . . . . . . . . . . . . . . . 51
4.5.7 Software Documentation . . . . . . . . . . . . . . . . . . . . . 51
4.5.8 Software Collaboration . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Software engineering principles followed by data scientists at Company
X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.1 Background of the interview subjects . . . . . . . . . . . . . . 52
4.6.2 Consistent Coding . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6.3 Automate Task Performances . . . . . . . . . . . . . . . . . . 54
4.6.4 Incremental Coding . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6.5 Avoiding Repetition . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6.6 Testing Notebooks . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6.7 Software Optimization . . . . . . . . . . . . . . . . . . . . . . 56
4.6.8 Software Documentation . . . . . . . . . . . . . . . . . . . . . 56
4.6.9 Software collaboration . . . . . . . . . . . . . . . . . . . . . . 56

5 Discussion 57
5.1 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Generalizability of findings . . . . . . . . . . . . . . . . . . . . 58
5.1.2 Feature engineering and importance scores . . . . . . . . . . . 59
5.1.3 Utility of modeling results for Company X . . . . . . . . . . . 59
5.1.4 Future research in the modeling domain . . . . . . . . . . . . 60

5.2 Software engineering principles in data science programming . . . . . 60
5.2.1 Software engineering principles followed by Company X . . . . 61
5.2.2 Data exploration and final software delivery . . . . . . . . . . 61

xii


Contents

5.2.3 How Company X applies software engineering principles com-
pared to the thesis’s development process . . . . . . . . . . . . 62

5.2.4 Recommendations for Company X . . . . . . . . . . . . . . . . 63
5.2.5 Future research for software engineering principles in data

science programming . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Conclusion 65

Bibliography 70

A Interview questions I

xiii


Contents

xiv


List of Figures

1.1 Most common chain of events. . . . . . . . . . . . . . . . . . . . . . . 4

3.1 The conducted research and its methodology steps in correlation
with the eight research strategies for software engineering by Stol &
Fitzgerald, (2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 The two parallel processes of data science and software engineering
conducted during this thesis. . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 NUTS1 regions of Germany. . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Traditional train and test split. . . . . . . . . . . . . . . . . . . . . . 31
3.5 Train and test split for cross validation. . . . . . . . . . . . . . . . . . 31
3.6 Comparing tools and techniques used by us with what is used by data

scientists at Company X. . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 ARIMA model results for region DE1 with service type A. . . . . . . 37
4.2 BSTS model results for region DE1 with service type A. . . . . . . . 38
4.3 GAM model results for region DE1 with service type A. . . . . . . . . 38
4.4 ARIMA model results for region DEA with service type A. . . . . . . 39
4.5 BSTS model results for region DEA with service type A. . . . . . . . 40
4.6 Components of the fitted BSTS model over region DEA with service

type A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 GAM model results for region DEA with service type A. . . . . . . . 42
4.8 ARIMA model results for region DE7 with service type B. . . . . . . 43
4.9 BSTS model results for region DE7 with service type B. . . . . . . . 43
4.10 GAM model results for region DE7 with service type B. . . . . . . . 44
4.11 Flowchart of created software. . . . . . . . . . . . . . . . . . . . . . . 49

xv


List of Figures

xvi


List of Tables

3.1 Shipment information table. . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Event tracking information Table columns. . . . . . . . . . . . . . . . 27
3.3 Regression variables with corresponding description. . . . . . . . . . . 28
3.4 Share of missing time-series data. . . . . . . . . . . . . . . . . . . . . 30

4.1 Table containing the mean for share of late packages for service type B. 36
4.2 Table containing the mean for share of late packages for service type A. 36
4.3 Table containing the the statistical measurements for region DE1 with

service type A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Table containing the the statistical measurements for region DEA

with service type A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Table containing the the statistical measurements for region DE7 with

service type DPDf Priority. . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Result of statistical measurement for all service types combined. . . . 45
4.7 Result of statistical measurement for service type A. . . . . . . . . . . 45
4.8 Result of statistical measurement for service type B. . . . . . . . . . 45
4.9 The aggregated cross validation (CV ) score for all service types,

service type A and service type B. . . . . . . . . . . . . . . . . . . . 46
4.10 A Table showing the 23 chosen ARIMA models. . . . . . . . . . . . . 46
4.11 Regression variables for the created BSTS models. . . . . . . . . . . . 47
4.12 Table containing variable significance scores in fitted GAMs. *: P-

value < 0.05, **: P-value < 0.01, ***: P-value < 0.001. . . . . . . . . 48
4.13 Connecting software engineering principles with tools which was im-

plemented to achieve them during model development by the thesis
writers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.14 Connecting software engineering principles with tools and techniques
used currently at Company X during data science programming . . . 53

xvii


List of Tables

xviii


1
Introduction

In a digitally changing world, the digitalization of techniques becomes essential for
businesses to stay competitive (Saarikko et al., 2020). Industries have started to
implement data science tools in their businesses to challenge the market and create
innovation. Factors that are essential for an industry to successfully utilize the powers
of data science are the availability of large amounts of data and the availability to
collect it in a sustainable manner (Sarker, 2021). By studying the data, companies
can create forecasting models and understand patterns within their business (Breuker
et al., 2016).

One large industry, which has seen a direct benefit of data science tools, is the supply
chain industry. In this industry, large amounts of data that can be gathered is
generated continuously (Vandeput, 2021). The term supply chain was firstly coined
in 1982. It describes a network of organizations and companies which are involved in
the various production or delivery processes that create value in the form of products
or services to an end consumer (Stadtler and Kilger, 2011). The focus on supply
chain management involves the management of the relationships between the involved
organizations or companies in a supply chain to achieve a more profitable outcome
for all involved parties (Christopher, 2005). Activities in supply chain management
include information sharing, inventory management, and network design.

Digitalization and new techniques allow supply chain networks to work differently
and to improve performance, despite this only a relatively small amount of companies
have utilized the full potential of digital tools (Vandeput, 2021). Cam et al., 2019
states that 53% of the polled manufacturing executives report increased revenues
and 61% report decreased costs as a direct consequence of the introduction of data
science tools in their supply chain. One application of data science tools is to study
delivery delays of commercial outbound parcels. Issues such as small delivery delays
at the early stages of the delivery can lead to larger delays at the end, partly due to
the bullwhip effect (H. Lee et al., 2004). Therefore, using digital tools can become
essential in the future of the industry to understand these effects.

This thesis will be conducted in collaboration with Company X, a cloud-based supply
chain company that operates as a delivery manager by connecting stores, warehouses,
and shipping locations with carriers, delivery networks, and service providers. The
specific case study will involve parcels delivered for Company G’s online stores, for
whom Company X is the delivery manager. Company G is an international retailer
in the clothing industry.

1


1. Introduction

The issue of parcels being late was initially discussed as a classification problem with
Company X, where the aim is to predict if a parcel will be classified as delayed or
in time, based on its features such as carrier, region, and states in the event chains
it has passed through. However, providing a classification algorithm does not give
much value to Company G, since knowing that a package will be delayed after it has
already been shipped leaves little room for action.

Instead, it is of higher interest for Company X to forecast temporal behavior of
aggregated shipments to regions, which give Company G the possibility to plan
support for predicted days with an unusually high rate of delivery delays. Hence,
time series forecasting methods are suitable for the issue at hand. By modeling
and analyzing the ordered sequence of observations and predicting days where the
business will report a higher rate of delivery delays than usual, Company G can plan
for these days. This thesis will focus on the techniques auto-regressive integrated
moving average (ARIMA), Bayesian structural time series (BSTS), and generalized
additive models (GAM) to perform time series analysis and to compare the techniques
with each other since these models have rarely been applied in the given context.
The details of these models will be elaborated upon in the next chapter.

The thesis aims to study whether tracking data can be used as regression variables
to help forecast delivery delays using time series. BSTS and GAMs enable the use
of regression variables in time series forecasting. BSTS have built in functions for
choosing most important regression variables using spike-and-slap priors (S. Scott
and Varian, 2014), making this an easy model to study impact of adding regression
variables. The chosen regression variables are linear. Instead of individual linear
predictors, GAM allows modelling of smooth functions which has the capability of
finding more complex patterns (Hastie and Tibshirani, 1986), such as irregular spikes
in the time series by modelling non-linear regression variables. ARIMA is one of
the most widely used approaches in time-series forecasting but does not use any
regression variables. It is therefore a suitable baseline model for comparison.

While there are many exciting areas of the fast approaching utilization of digital tools
in supply chain industries, the fast-growing field faces many challenges simultaneously.
One challenge is that data scientists often lack knowledge in engineering principles
when developing data science models (Russell et al., 2019). To develop a sustainable
framework within data science programming, data scientists should use the help
of other research areas, such as software engineering (Wilson et al., 2014; Russell
et al., 2019). By applying software engineering methodologies and principles in the
development process of data science models, the maintainability and reliability of the
models can be improved (Wilson et al., 2014). Since software engineering practices
involve systematic and efficient development processes to maintain and improve
software robustness (Laranjeiro et al., 2021), these principles should be possible to
implement in a data science context as well.

Theoretical software engineering principles are rarely applied in real software devel-

2


1. Introduction

opment settings (Al-Sarayreh et al., 2021), resulting in gaps between theory and
industry. Therefore, with the help of Company X, the validity of various software
engineering principles suggested by previous research is analyzed. This is achieved
by applying the found principles during model development and by conducting
interviews with data scientists at company X. Thus, a comparison can be done
between what industry (Company X) perceives as important, and what principles
were found to be important during our model development process and theoretical
research. Developing the models allow us to mimic tasks that the data scientists at
Company X works with.

1.1 Statement of the problem

It is suggested that new, modernized approaches using data science tools can improve
predictions within the supply chain network (Vandeput, 2021). As briefly mentioned,
the thesis will be conducted together with Company X. Although being a cloud-
based software company operating in supply chain businesses, the efforts within
data science have only in recent years accelerated for the company. Today, only a
handful of data scientists work at Company X. To further develop data science
efforts within the company, they wish to investigate the use of time series analysis
and the models ARIMA, BSTS, and GAM to forecast delivery delays within their
supply chain network.

At the same time, the implementation of data science models require better approaches
in order to create reliable and maintainable software (Russell et al., 2019). Since the
data science department at Company X is still relatively young, they wish to create
a framework for the development process that accommodates software engineering
principles.

1.2 Research question

To fulfill the purpose of this thesis, the following research questions will be answered:

1. How does BSTS and GAM models perform, in combination with tracking-event
data, in the task of forecasting parcel shipment delays in comparison with a
traditional ARIMA model?

2. What software engineering principles are there for data science programming
and how can these be implemented in practice?

3. How are software engineering principles applied in data science programming
currently at Company X?

3


1. Introduction

Figure 1.1: Most common chain of events.

1.3 Project case
The specific case within Company X is to study shipments from Company G in
Germany. The data consists of shipments from one fixed location to customer
locations spread across Germany. Variation between different shipments manifests
itself in varying geographic locations, types of carriers performing the shipment,
types of services, delivery routes, and so on.

Delivery of a package from an origin point to a final destination can be divided
into a series of sequential events. Each registered event data point consists of a
timestamp. Examples of these events are departure from the initial location, arrival
at intermediate hubs, and in-flight changes to the shipment. The chain of events
that a shipment undergoes during its journey from onset to the final destination
is also subject for variation. Occasionally, duplicates of events appear during the
lifetime of a shipment. In other cases, data on shipment events might also be missing.
The duration of time that passes between two sequential events also varies between
shipments. This is due to factors such as varying distance between hubs, changing
amounts of traffic during different hours of the day, and so on.

The success of a shipment is measured by taking the difference between the estimated
time of arrival (ETA) and the actual time of arrival (ATA). The ETA is set as soon
as a shipment is confirmed. A successful shipment has minimal difference between
ETA and ATA. This thesis will only consider late arrivals as unsuccessful deliveries.
Shipments that are delivered earlier than the set ETA is still considered as successful.
This difference between ETA and ATA is referred to as delivered on time (DOT) and
a shipment is said to be on time if DOT = 0. A shipment’s DOT = 0 if and only if
ETA − ATA < 24h. DOT is defined as a binary label to each shipment where either
a shipment is late or on time. The most common chain of events for the studied
shipments can be found in Figure 1.1.

1.4 Technical environment
The technical workflow of the project is carried out on the Google Cloud Platform
(GCP), which is also used by data scientists at Company X. Queries are made in
GCP’s BigQuery environment to access available data stored in the cloud server.
The data science team at Company X working with Company G creates models in
Jupyter notebooks using both Python 3 and R kernels in GCP’s AI workbench. In
the notebook, data is accessed by loading saved tables from BigQuery. These tables
are then converted into data frames that are subsequently used in the notebooks as

4


1. Introduction

input for model creation. All GCP tools available for the data scientists at Company
X are available for the technical environment in which this thesis is conducted.
Extensions such as cloud source repositories, workflow automation, and resource
utilization analysis tools are all available.

The programming language chosen for the project is Python 3 for data preprocessing
since this is the language mainly used at Company X for its data science projects,
and R for model creation since R has pre-developed libraries for ARIMA, BSTS,
and GAM time series. The choice of technical environment is to mimic how data
scientists work currently at Company X.

5


1. Introduction

6


2
Theory

This chapter presents the essential theory needed to understand ARIMA, BSTS,
and GAM models. Furthermore, software engineering practices in the context of
data science are also studied in order to build a framework for software engineering
principles in data science projects. This chapter sets the foundation for the results
and discussion chapters. The reader is assumed to have knowledge of statistical
modeling and software engineering to some extent.

2.1 Delivery time estimation
Despite the potential benefit of data science in logistics and supply chain networks,
data science tools have yet to reach their full potential in these fields. Existing
supply chain organizations use traditional software algorithms to predict delivery
times (Niranjan et al., 2021). The logistics and supply chain management industry
is overwhelmed with uncertainties due to the growing generation of structured, semi-
structured, and unstructured data (Akbari and Do, 2021). Therefore, data science
tools can be powerful due to their computational power and ability to adapt to
large-scale data sets and reduce this uncertainty. Today, this data tends in many
cases to be underutilized (Niranjan et al., 2021). Data science tools have the potential
to improve overall logistics and supply chain performance by identify the driving
factors that determine the success of a supply chain network (Makkar et al., 2020).

As carriers deliver record-breaking e-commerce volumes and the risk of delayed
shipments increases, more visibility is needed in parcel tracking to better understand
where parcels are in the delivery network and also to reduce errors in estimated
delivery times (Roberson, 2021). The technology behind parcel tracking has been
around for years and has been enhanced over the years to the inclusion of services
such as parcel re-routing and scheduling two-hour window deliveries (Roberson, 2021).
Furthermore, as E-commerce businesses place a high emphasis on customer-centric
services and products, shipment delays have become a critical issue to be tackled as
soon as possible.

Akbari and Do, 2021 list a few explanations for the limited adoption of data science
tools in the supply chain such as skill deficit in data science competence, cost of
adopting data science tools, and lack of high-quality data sets. Shipment delivery time
estimation has been researched from the perspective of shipment delay classification
by Keung et al., 2021, where various algorithms such as Naive Bayes, Decision tree,

7


2. Theory

ANN, and KNN classifiers were compared for shipment delay classification. In the
context of liner shipment classification. Salleh et al., 2017 studied the prediction of
arrival punctuality of a vessel at a port of call. Although various models have been
used for the prediction of shipment delays, research still needs to be done in the area
of parcel shipping delay prediction using time series.

2.2 Time Series
Time series analysis is a useful tool for data science programming practices in contexts
involving temporal measurements. The purpose of time series analysis is to extract
meaningful statistics and other characteristics of temporal data, which are then used
for pattern recognition, forecasting, and other statistical modeling. At the most
basic level, time series analysis is concerned with fitting some function of time to
time series data and extrapolating it into the future (Brockwell and Davis, 2009).

Definition 1 (Time series) A time series is a sequence of observations (xt, t ∈ T)
with respect to an index set T ⊆ R. A time series model is a specification of the
joint distributions (or possibly only the means and covariances) of the sequence of
random variables (Xt, t ∈ T) from which the sequence of observations (xt, t ∈ T)
is assumed to be a realization from (Brockwell and Davis, 2009).

Therefore, when discussing time series modeling, a selection of a suitable probability
model for the observed data as a joint distribution of the sequence of random variables
(Xt, t ∈ T) (Brockwell and Davis, 2009) is refered.

Definition 2 (Discrete time series) A discrete time series (xt, t ∈ T0) is one in
which the set T0 is discrete (Brockwell and Davis, 2009). A continuous time series is
obtained when the data is continuously observed over some time interval, for example
T = [0, 1].

In reality, only the stochastic process at finite times can be observed. Only discrete
models will therefore be considered and for simplicity, group data to their respective
calendar day so that discrete set of integers T ⊆ Z+, will be observed.

For many economic and social time series, it is typical that its salient characteristics
are the trend, which expresses the long-run evolution of the series movement, and a
seasonal component that repeats itself more or less over the observed time (Harvey,
1990). These characteristics should be captured by a time series model for the series.
Such a model can be formulated in several ways. A starting point is to assume that
the series can be decomposed in the following way (Harvey, 1990).

Observed series = Trend + Seasonal + Irregular (2.1)

The irregular component represents nonsystematic movements in the series and is
often modeled as Gaussian white noise. Such a process is also known as a random
walk. A time series may also contain other components and in many cases, the

8


2. Theory

properties of a series are subject to change over time. Therefore, it is beneficial that
a model can adapt to changes in the properties of a series. A principal structural
time series model is a time series model in which the parameters are allowed to
vary over time and one that can be formulated in terms of components that have a
direct interpretation. The different components enter into the model in an additive
fashion. The stochastic properties of a structural model depend on the full set of
disturbances. In time series modeling, it is essential to characterize the properties of
these disturbances. (Brockwell and Davis, 2009).

2.2.1 Stationarity
Three conditions must be satisfied for a stochastic process to be stationary, that is
(Brockwell and Davis, 2009):

E(yt) = µ (2.2)
E[(yt − µ)2] = Var(yt) = γ(0) (2.3)
E[(yt − µ)(yt−τ − µ)] = γ(τ) (2.4)

Where E(.) is the expectation, µ is the mean value, Var(.) is the variance and γ(.)
is the autocovariance function. The autocovariance is the covariance between an
observation at an arbitrary time and a later occasion where the total difference in
time between the two observations is τ . What is essential about a stationary time
series is that its properties do not change over time. The mean, the variance, and the
autocovariance, all are independent of the time t. Only the relative time difference τ
between two observed time points is relevant in the analysis. Stationarity can be
achieved by differencing, which is the action of subtracting a previous observation
yt−1 from the succeeding observation yt. that is:

Difference(t) = yt − yt−1 (2.5)
This is done for the whole observed time series. A time series may need to be
differenced several times before it becomes stationary (Brockwell and Davis, 2009).

2.3 Auto regressive integrated moving average
models

A general class of time series models that are also capable of exhibiting non-stationary
behavior are called ARIMA(p,d,q) processes. ARIMA is an abbreviation of Auto
Regressive Integrated Moving Average (Harvey, 1990). An auto regressive model
creates forecasts of the target variable using a linear combination of past observations
of the target (Hyndman and Koehler, 2006). Auto indicates that it is a regression of
the variable upon itself.

In order to understand the ARIMA(p,d,q) model in greater detail, an explanation to
how its individual components behave will be given. First, the auto-regressive part
is explained. An auto-regressive model of order p(AR(p)) is shown below:

9


2. Theory

yt = c + ϕ1yt−1 + ϕ2yt−2 + . . . + ϕpyt−p + εt (2.6)

Here, εt denotes an additive noise mechanism. As seen in equation 2.6, yt is forecasted
with lagged values of yt as predictors. Equation 2.6 is referred to as an AR(p) model
meaning an auto-regressive model of order p. The ϕ1, . . . , ϕp are the model parameters
and c is a constant. Changing ϕ1, . . . , ϕp results in different time series patterns. The
AR(1) model can be investigated in order to exemplify the auto-regressive model,
which is restricted to stationary data (Hyndman and Koehler, 2006):

• ϕ1 = 0 gives yt equivalent to white noise.

• ϕ1 = 1, c = 0 gives yt equivalent to a random walk. If c ̸= 0, yt is equivalent to
a random walk with drift.

• ϕ1 < 0 gives a yt which fluctuates around the mean.

The MA(q) part of ARIMA is the moving average model which has the following
mathematical notation (Hyndman and Koehler, 2006):

yt = c + εt + θ1εt−1 + θ2εt−2 + . . . + θqεt (2.7)

The MA(q) model uses past forecast errors in a conceptual regression model instead
of past values of itself. c denotes the mean of the series while the θ1, . . . , θ1 values
are the parameters of the model. The εt, . . . , εt−q values denotes the additive noise
mechanism for each temporal observation. q is the order of the MA model.

To create an auto regressive moving average (ARMA) model, the time series needs to
be stationary (Hyndman and Koehler, 2006). The I(d) in ARIMA is the integrated
part which refers to the number of times it is required to differentiate the studied
time series in order to achieve stationarity. When combining differencing with an
auto-regressive model and a moving average modelm an ARIMA model that forecasts
its target yt is obtained, with predictors that include both lagged errors and lagged
values of yt.

2.4 Bayesian structural time series models
BSTS are models that mainly consist of three distinct components (S. Scott and
Varian, 2014). These components are:

• A “basic structural model”. This is the technique used for time series decompo-
sition into different elements, such as trend and seasonality. This is estimated
using Kalman filters. (S. Scott and Varian, 2014). Kalman filters are recursive
algorithms that produce estimates of the state vector at time t, based on the
information available at time t (Harvey, 1990). The state-space equations will
be elaborated upon in Subsection 2.4.1. The interested reader is referred to

10


2. Theory

chapter three in Harvey, 1990 for further information regarding the topic of
Kalman filters.

• Spike-and-slab method. This is a Bayesian prior that is used to find the most
important regressor (independent, predictor) variables in the model.

• Bayesian model averaging. Bayesian model averaging provides a way to account
for model uncertainty and is simply a weighted mixture of the probabilities
under each model.

The three components, Kalman filter, spike-and-slab regression, and bayesian model
averaging tend to work well together. The parameters that must be estimated in the
model are the spike-and-slab variables, which determine what variables are included
in the model, the regression coefficients for the regression part of the model, and
finally the variances of the error terms in the different components of the structural
time series (S. Scott and Varian, 2014). In the following subsections, a brief overview
of each method is presented.

2.4.1 Structural time series

The practical usefulness of structural time series stems from the fact that they are
flexible and modular (Brodersen et al., 2015). The flexibility is derived from the fact
that a large class of models, including ARIMA models, can be written in the state
space form. Structural time series models are exactly state-space models for time
series data. The structural time series models can be defined in terms of the pair of
equations.

yt = ZT
t αt + ϵt (2.8)

αt+1 = Ttαt + Rtηt (2.9)

where ϵt ∼ N (0, σ2
t ) and ηt ∼ N (0, Qt) are independent from the other unknown

variables. Equations 2.8 and 2.9 are called the observation equation and the state
equation. The latter governs the evolution of the latent d-dimensional state vector
αt over time. The former equation links the observed data yt to to the state vector
αt. yt is a scalar observation. Zt is called the output vector and is d-dimensional. Tt

is the state transition matrix and has dimension d x d. Rt is called a control matrix
with dimension d x q. ϵt and ηt are error terms for the observation yt and the system
Rt.

Structural time series are modular in the sense that it is possible to assemble the
model matrices Zt, Tt, Rt, and Qt from a library of modular submodels (i.e., the
BSTS library by S. Scott and Varian (2014)) to capture important features of the
data such as trend and seasonality.

11


2. Theory

2.4.2 Spike-and-slab
The spike-and-slab method is a variable selection method that can be used for
selecting the most “important” variables in a Bayesian setting (S. Scott and Varian,
2014). Let γ be a vector with the same length as the number of regression variables
in the data. Let γk = 1 if βk ̸= 0 and γk = 0 if βk = 0. Denote the subset of
elements of β where βk ̸= 0 as βγ. The spike-and-slab prior can be described as:

p(β, γ, σ2
ϵ ) = p(βγ|γ, σ2

ϵ )p(σ2
ϵ |γ)p(γ). (2.10)

p(γ) is the “spike” since it places positive probability mass at zero. It is often
convenient in practice to use an independent Bernoulli prior:

γ ∼
K∏

k=1
πγk

k (1 − πk)1−γk . (2.11)

where π can be interpreted as the “expected model size”.

For simplicity, one can choose πk = π ∀ k. Then, if one expects p predictors, one
can simply choose π = p/K where K is the number of predictor variables in the
model (S. Scott and Varian, 2014). It is also noteworthy that specific variables can
be forced to be included by setting πk = 1. The thesis will not elaborate further
upon the mathematical underpinning of the model since the BSTS software package
includes default values for the parameters and it lies outside the scope of the report.
For further reading on the topic, the interested reader is referred to S. Scott and
Varian, 2014.

2.4.3 Bayesian model averaging
In order to yield forecasts of yt+1, draws from the posterior distribution of the
parameters of the model will be performed. The parameters are drawn and combined
with the available data to get a forecast for the specific draw. An estimate of the
posterior distribution of the forecast yt+1 is obtained by repeating these draws many
times. This way of producing forecasts are motivated by the statement of S. Scott
and Varian (2014):

[. . . ] averaging over an ensemble of models does no worse than using the
best single model in the ensemble.

The draws and forecasts are generated automatically by applying the prediction
method predict.bsts() supplied by the BSTS library by S. Scott and Varian
(2014).

2.5 Generalized additive models
To understand GAMs, it can be helpful to first understand generalized linear models
(GLM). A GLM relates an expected value E(Y) to linear predictors Xβ with a

12


2. Theory

link function g(.) which provides a relationship between the linear predictor and the
expected value.

in the case of GAMs the response variable depends on a set of smoothing functions
sk(.) (Hastie and Tibshirani, 1986):

K∑
k=1

sk(Xi). (2.12)

In other words, a GAM produces its predictions by summing over the values generated
by the set of smoothing functions sk(.), where Xi is the set of predictor variables in the
data. while traditional linear regression models assume a linear form on the covariate
effects, GAMs are capable of modeling the dependence of the target variable in a
non-parametric fashion (Hastie and Tibshirani, 1986). The mathematical annotation
for GAM can thus be written as:

g(E(Y)) = α + s1(x1) + s2(x2) + . . . + sn(xn) (2.13)

Where Y is the target variable and E(Y) denotes the expected value of Y . g(Y ) is the
link function which links expected value of the predictor variables X1, X2, . . . Xn. sk(.)
is a set of smoothing functions that are capable of capturing non-linear relationships
between the target variable and the predictor variables. The predictor is an additive
function which means that it is the sum of the set of fitted smoothing functions,
hence the name Generalized Additive Models. The smoothing functions sk(.) are
functions that are unspecified and are estimated with scatterplot smoothers (Hastie
and Tibshirani, 1986). Some examples of scatterplot smoothers that Hastie and
Tibshirani, 1986 recount are running mean, running median, running least-square
line, kernel estimate, and spline.

2.6 Application areas for ARIMA, BSTS and,
GAM

ARIMA models were popularized in 1970 by George Box and Gwilym Jenkins. They
are time series analysis models suitable for time series which can be assumed to have
a reasonable amount of continuity between the past and present (Stellwagen and
Tashman, 2013). ARIMA models have been applied in price prediction of electricity
(Alsaedi et al., 2019). Other econometrics areas are also common application areas for
ARIMA models since it gives the user an understanding of the mean, autocorrelations
and trends of the data. The authors Abu Bakar and Rosbi (2017) also suggest
econometrics areas for application such as forecasting market price changes, national
house price inflation, and investment strategies for timing purposes. The application
of ARIMA models in supply chain networks has been limited. Wang et al. (2021)
uses ARIMA models for demand forecasting for Taiwan’s semiconductor industry
with the purpose of inventory management which can be related to supply chain
network management forecasting.

13


2. Theory

BSTS has been utilized in the context of news sentiment influence on stock prices
Ray et al., 2021; For fuel sales forecasting (Lian et al., 2021); Brodersen et al. (2015)
used a BSTS model for inferring the causal impact of advertising efforts at Google.
Güvercin et al. (2021) applied various time series for predicting, not shipment delivery
times, but flight delays using clustered network models based on airports where
the models are applied individually to each airport data for predicting flight delays.
Others have also studied flight delays (Yiu et al., 2021). The use of BSTS models in
the context of predicting shipment delivery delays is limited. One explanation for
this is that carrier data is not universally available in contrast to flight data.

GAMs are widely popular and have been applied in several different research areas. In
research areas relevant to this study, 206 research articles are published (Engineering
(100), Decision Science (73), Economics, Econometrics, and Finance (73)). Some
examples of the applications of GAMs in time series analysis are prediction of food
sales and beverages in staff canteens and restaurants (Posch et al., 2022), association
between meteorological factors, and daily new cases of COVID-19 (Yuan et al., 2021).

2.7 Software engineering principles

Al-Sarayreh et al. (2021) does a systematic mapping and a quantitative literature
review of general software engineering principles and finds 30 relevant papers on
this subject ranging from the publication year 1969 to 2020. A total of 592 software
engineering principles were proposed. Naturally, many of them express similar
meanings, but some express contradicting meanings. The authors state that after the
literature review of the found 30 papers, more work needs to be done in integrating
principles with practices, which in this thesis is expressed as suggesting tools and
techniques for applying said principles. The authors also suggest that further work
needs to be done in making software engineering principles more available across
application domains and other disciplines such as data science programming. Finally,
Al-Sarayreh et al., 2021 claims that more work needs to be conducted in applying
software engineering principles in real software development environments.

2.7.1 Strategic engineering areas for data science program-
ming

In practice, a data science model is only a minor part of the overall area of data
science programming. Despite modeling only being a smaller part of the whole area,
it is still essential that a model works properly so that it can be a reliable component
of a larger software system. The larger software system also covers tasks such as
engineering of data pipelines, monitoring, and logging (Bosch et al., 2021).

In a typical prediction model, integrated into a larger software system, there are four
major data science activities (Bosch et al., 2021). These are: assembling of data sets,
creating and evolving the machine learning model (model development), training
and evaluating the model, and finally deployment. There are engineering challenges

14


2. Theory

that need to be fulfilled for the model to perform reliably in the system (Bosch et al.,
2021).

The first challenge is data quality management. Engineering principles must be set
to retrieve data of sufficient quality for training and inference. Questions such as
handling missing and unbalanced data, and other preparations must be done.

The second strategic area is during design methods and processes. To create an
understandable and easy-to-use model it must be repeatable and of high engineering
quality.

Thirdly, model performance is a challenging area in the model development process.
The performance of the model is measured both in accuracy and quality attributes.
Therefore, both problems in training data and lack of support for quality attributes
must be addressed.

Deployment and compliance of the model is the final strategic area that must be
addressed. Ultimately, deployment of a model in an operational system might require
changes in the overall architecture of the system.

2.7.2 The need for software engineering principles in data
science programming

When developing software products, many data scientists lack skills in formal and
rigorous software engineering principles (Russell et al., 2019). In some cases this
results in software with poor maintainability and reliability. Because of this, extra
time needs to be spent on tasks such as debugging. To increase the maintainability
and reliability of a model, many software engineering methodologies can be used
(Wilson et al., 2014).

One large reason for why data scientists rarely follow software engineering principles
when programming, is that there is little exchange of ideas between the general
software engineering community and the data science community according to
Hannay et al. (2009). The reason for this is that data science developers, and
“regular” software developers often work with tasks of different nature. Although
both professions involve programming, the end result of the programming tasks differ.
This leads to surprisingly little collaboration between software engineers and data
scientists since the collaboration opportunities are lacking.

Hannay et al. (2009) also claims that there is a lack of formal training for data
scientists in the fields of software engineering. The software development aspects
of their work are often self-taught. The formal training received is also often given
by computer science departments, which teach general software courses that data
scientists find less relevant for their domain-specific knowledge. Another aspect is
that data scientists might not see the need for formal software development training.
This stems from the fact that data science programming is often firstly done on a

15


2. Theory

small scale in an exploratory manner which grows larger in scale with time if the
software proves its usefulness in scientific investigations. Before the software proves
its usefulness, it is uncertain if it will be a part of a larger end-product. Therefore,
the visible need for software engineering practices in data science programming is
often shown when it is too late according to the authors.

2.7.3 Software engineering principles in data science pro-
gramming

Wilson et al. (2014) presents a set of eight software engineering practices that should
be adopted when developing data science models. This thesis builds upon the
suggested software engineering principles presented by Wilson et al., 2014, but also
complements these for domain relevant techniques and tools which help the data
scientists at Company X to adopt the given principles.

Consistent coding implies writing code that is both understandable and easily read
by other programmers. This is to increase the productivity when coding according
to Wilson et al. (2014). A parallel that the authors draw is breaking up code in
the same way as scientific papers are broken down into sections and paragraphs
to understand the content. Furthermore, the code need consistency in its styling
format.

Suggested techniques for Consistent coding are breaking up code into easily un-
derstood functions, each of which conducts a single task. Variables and functions
should be named consistently, distinctive from each other. The name should explain
what the function does or what value the variable holds. Deissenbock and Pizka,
2005 suggests that an useful tool for consistent naming is an identifier dictionary.
By storing information about all identifiers such as their names, the type of the
object, and a comprehensive description, the consistency of the code written can be
improved.

Another suggestion is to use different types of coding style frameworks to keep a
consistent coding style. An example of a coding style framework is the PEP8 style
guide for Python 3 code (van Rossum et al., 2001). The guide provides rules when
writing Python 3 code such as consistent variable declaration, code length and
documentation approaches. The PEP8 style guide can be implemented with the use
of the Pylint library which catches bugs and style problems in the Python 3 code
written.

Automating task performances allows the program to unburden the programmer of
some tasks. When working with data science models, it can happen several times
that a task is performed repeatedly. To save time, these repeatable tasks could be
automated instead of performed manually.

Wilson et al., 2014 suggest that recent commands should be saved in a file for re-use.
One simple approach to do this is to use command-line tools which have a history

16


2. Theory

option that allows users to re-execute historic commands. Commands can also be
saved in a notebook by creating a cell which re-runs a set number of functions and
classes.

Another technique for automating task performances is to run time-consuming
computer programs at optimal times when the programmer is inactive. GCP allows
the user to schedule a notebook’s runs by using Vertex AI workbench (Namjoshi,
2021). By scheduling training of models at optimal times and re-training when
updates have occurred to the data, the efficiency of a data scientists work can
increase. Automation of programs can also be done through GCP airflow tools,
where scheduled running sequences can be set to run depending on the results
produced by parts of the program GoogleCloud, 2022a. Utilizing this tool decreases
the manual labour required by a data scientist.

Incremental coding means working on a software step-wise. A major difference
between commercial software development teams and data science programming
teams is that data science programming teams tend to not get static requirements
from stakeholders (Wilson et al., 2014). The requirements they receive are seldom
technical but rather regarding exploration of data. In addition, data scientists often
cannot know what to do next before the current version of a model has produced
some results. Therefore, it is of great advantage to work in incremental steps with
frequent feedback from the stakeholders during data science programming.

When working incrementally, a challenge that can arise is the tracking of changes
to a software, especially when working in collaboration with other programmers
(Wilson et al., 2014). Having several people changing the same code and with
requirements being agile, programmers need to set up a version control system to
keep track of the software. The purpose of a version control system is to store
historical changes a repository. By doing so, projects can be worked on individually
and locally, and the final program is only modified when a programmer commits their
changes. Furthermore, version control systems also allow users to resolve conflicts
when several people have edited files simultaneously and review each others code.
GCP allows the user to use cloud source repositories, which is a tool for version
control (GoogleCloud, 2022b). In the cloud source repositories, the users can perform
git operations and keep track of the history of the developed model. The GCP also
allows version control of the data. By using version control on the data, the user can
recreate previously produced results of the model.

Furthermore, Wilson et al. (2014) suggests that it is preferable to follow an agile soft-
ware development methodology over a traditional “waterfall” software development
methodology.

Avoiding repetition is essential for software modularity. If changes and updates are
made to code which is repeated in multiple places, the risk of errors en inconsistencies
increases (Wilson et al., 2014).

17


2. Theory

The main technique suggested for this principle is writing modular code, according
to Wilson et al., 2014. In this way, duplication of code is minimized and bug fixes
are implemented faster. With modularized code, the user can also re-use code rather
than rewriting it. Modular code can also be achieved by writing modules and libraries
which Jupyter notebooks are dependent upon rather than writing all of the code in
notebook cells. Creating functions can also help the programmer to avoid rewriting
code.

By testing notebooks data scientists are more likely to produce reliable and maintain-
able code. Wilson et al. (2014) states that verification of the code and maintaining
the validity of the code over time is essential in data science projects.

By writing assertions and unit tests, the validity of the code can be maintained.
Data scientists lack the habit of writing tests as a traditional software engineer does,
partly because of the lack of coding standards but also because the environment
where programming is done differs (Russell et al., 2019). One package which can be
utilized in Jupyter notebook environments to write tests is testbook which allows
unit tests to be run on notebooks in separate test files and therefore treating Jupyter
notebook files as Python files (nteract team, 2021). unittest is also possible to
utilize directly within a Jupyter notebook to create unit tests.

The authors Russell et al., 2019 claim that the software engineering principle within
data science model development which has been lacking the most in progress in
recent years is software testing. The authors show that from the majority of all
available comprehensive R Archive Network packages, around 70% do not have tests
made available with the package. The authors highlight the importance of writing
unit tests and acceptance tests during development of R packages. Furthermore, the
authors also state that unit tests should be written by the developers of the software.

Software optimization allows the software to utilize its resources more efficiently to
become more time-efficient. Calculation programs can often be optimized for other,
lower-level, languages. Wilson et al., 2014 suggest that if data scientists wish to
optimize their code execution with lower-level languages such as C, it should be done
only after the code works correctly using higher-level languages. This is due to the
fact that higher-level languages help the user with program comprehension.

Software documentation can be done in several ways and is essential for a program’s
understandability. To highlight the importance of documentation in scientific pro-
gramming, Wilson et al., 2014 draws a parallel to experimental protocols for research
methods. In both cases, the purpose of the documentation procedure is to make the
artifact reproducible and understandable. The maintenance cost is also lowered with
documentation.

B. Lee, 2018 discusses the importance of writing comments as the programmer codes.
Error messages should be informative and provide a solution to the error if it occurs.
If the developed program operates in a command-line interface, it is beneficial to

18


2. Theory

include a help command for the functions which are created by the programmer.
B. Lee, 2018 also suggests using the package Click to create command-line interfaces.
Comments explaining functions and code snippets should also be written when
necessary. Beyond documentation within the program itself, external documentation
tools should be used according to B. Lee, 2018. A README file should be included
with basic information and a quick-start guide to the developed model. In contrast
to the authors Russell, Bennett, and Ghosh (2019), B. Lee, 2018 states that the most
underemphasized software engineering principle in data science programming has
been software documentation. The authors highlight that software reproducibility
suffers from the lack of software documentation.

Software collaboration is the final area of software engineering principles that should
be adopted in the domain of data science programming (Wilson et al., 2014). Collab-
oration during data science programming can eliminate bugs and improve the code’s
readability. It is also a good way to spread knowledge and best practices around the
team.

Techniques such as code reviews before merging and pair programming are brought
up as suggestions to implement collaboration when developing data science software.
Kalyan et al., 2016 states that utilizing git repositories for code review can improve
a team’s software collaboration efforts by allowing several team members to look
at pull requests before merging code. By implementing standards for code review,
developers can improve collaboration and ensure that other software engineering
principles are followed in a synchronized way within the team Panichella and Zaugg,
2020.

19


2. Theory

20


3
Methods

The methods chapter describes the methods followed during the thesis in order to
answer the presented research questions.

3.1 Research procedure
The authors Stol and Fitzgerald (2018) highlight eight different research strategies
in the domain of software engineering which are shown in Figure 3.1. The research
conducted during this thesis is classified as what Stol and Fitzgerald (2018) describes
as a field study:

[. . . ] A field study refers to any research that is conducted in a specific,
real-world setting to study a specific software engineering phenomenon.

The specific software engineering phenomenon studied in this thesis is the applicabil-
ity of ARIMA, BSTS, and GAM in forecasting parcel delays, and the applicability of
software engineering principles in data science programming. The study is unobtru-
sive, meaning that the researchers will not change or control any of the parameters or
variables of how Company X currently operates. It is also context specific since the
thesis only studies the applicability of the models and principles within Company X.

Since this thesis will evaluate the models ARIMA, BSTS, and GAM in time series
forecasting and also evaluate how software engineering principles can be applied in the
context of data science programming, the thesis can be categorized into two different
research areas, that is, data science and software engineering. The data science
area involves model development and evaluation while the software engineering area
involves applying and evaluating software engineering principles in a data science
context. The field study is conducted by following two parallel processes, which are
illustrated in Figure 3.2.

To implement the field study, a literature review is firstly conducted, where the current
state of research in both the relevant areas is examined. These areas are ARIMA,
BSTS and GAM models and the application of software engineering principles in the
domain of data science programming. The results of the literature review is outlined
in the theory chapter. In the second step of the field study, the models and applicable
software engineering principles are implemented in a real-world setting provided
by Company X. Lastly, the performance of the models are evaluated. The utility

21


3. Methods

Figure 3.1: The conducted research and its methodology steps in correlation with
the eight research strategies for software engineering by Stol & Fitzgerald, (2018).

Figure 3.2: The two parallel processes of data science and software engineering
conducted during this thesis.

22


3. Methods

of the found software engineering principles is evaluated by examining differences
between practices at Company X and theory. The utility of the software engineering
principles is also evaluated by examining the reasoning behind why some principles
are followed and why others are not.

3.2 Model development methods
In this section, the key concepts that the devised models are built upon are presented.
Before a methodology could be established, an initial data exploration phase was
carried out to understand the data at hand. The target variable that is to be
predicted is the share of late packages to a specific NUTS1 region for a specific
service type. NUTS stands for the Nomenclature of Territorial Units for Statistics
and is a geocode standard for hierarchically dividing a country into subdivisions
for a statistical purpose (Comission, 2021). The total number of NUTS1 regions in
Germany is 16. There are also two service types for each region, service type A and
service type B. Therefore there are 32 (16 regions and two service types) potential
time series to be studied.

The share of late packages is defined as the total number of late packages divided by
the total number of packages shipped during a day. A package is classified as late if
its ATA is at least 24 hours greater than its ETA. The data which the study is built
upon corresponds to shipment information for the calendar year 2021 in Germany.
In this thesis, three models that can forecast the target variable are implemented.
The share of late packages is forecasted individually for each NUTS1 region and
service combination. The NUTS regions are displayed in Figure 3.3.

3.2.1 ARIMA
Firstly, an ARIMA model is fitted to the various time series. An individual ARIMA
model is fit to each service type in each region’s time series of observed values of the
target variable. As stated previously, the observed values in the time series are the
share of late packages for each day. This ARIMA model will be used as a baseline
model and the subsequent models will be compared against the performance of the
baseline model and each other.

An optimal ARIMA model for each distinct regional time series of late shares is found
by applying the R function auto.arima developed by Hyndman and Khandakar
(2008). The function searches for possible models, within the order constraints
provided by the user and returns the best model according to either AIC, AICc, or
BIC value. The mentioned abbreviations are three different estimates of prediction
error. The standard setting used in auto.arima package is AIC, which is used during
this study as well. Furthermore, the order constraints provided for the auto.arima
function are the standard values for the function in the study. These are the following:

max(p, d, q)x(P, D, Q, s) = (5, 2, 5)x(2, 1, 2, 1) (3.1)

23


3. Methods

Figure 3.3: NUTS1 regions of Germany.

24


3. Methods

where the first three parameters (5,2,5) specify the maximum orders for the parameters
of the ARIMA function, and the four last parameters (2,1,2,1) specify the maximum
order of a potential seasonal component of the model. In other words, the auto.arima
function searches over a space of models which includes seasonal-ARIMA models. An
ARIMA model with a seasonal component is also commonly known as a SARIMA
model where the S stands for seasonal, hence the name.

3.2.2 BSTS
A BSTS model is fitted for every service type in every region. The R package bsts by
S. Scott and Varian (2014) will be used for implementation. The BSTS model which
is developed during this thesis contains four distinct state-space components. The
first component is an automatic auto-regressive component that adds a sparse AR(p)
process to the state distribution. The component contributes αt to the expected
value of Yt, the equation for αt is defined as:

αt = ϕ1αt−1 + ... + ϕpαt−p + ϵt−1, ϵt ∼ N (0, σ2). (3.2)
where the component consists of the last p lags of α (S. L. Scott, 2021). The
Auto-AR(p) component considers up to p number of lags for the time series models
and decides on the optimal amount. The ϕi, i ∈ [1, 2, . . . , p] are the coefficients
for the p lags of the chosen AR(p) model. N (0, σ2) refers to a normally distributed
variable with mean zero and variance σ2. In other words, ϵt is a white noise process.
The bsts packages chooses the optimal Auto-AR(p) values.

The second component of the BSTS-model is a local-linear trend component. This
component adds a local linear trend to the model which assumes that both the mean
and trend component of the time series follows random walks. The equations for
both the mean (3.3) and the slope (3.4) are the following (S. L. Scott, 2021):

µt+1 = µt + δt + ϵt, ϵt ∼ N (0, σ2). (3.3)

and
δt+1 = δt + ηt, ηt ∼ N (0, σ2). (3.4)

As one can observe, the mean value at an arbitrary time is a sum of the mean value
and the slope at the previous time of measurement and an error term. The slope at
a time t + 1 is simply the slope at time t and Gaussian white noise term.

The third component of the BSTS model is a seasonal component. Since Saturdays
are removed from all the regional time series (this will be elaborated upon in
Subsection 3.3.2), the seasonal component will have a periodicity of six. Denote
the number of seasons in the dataset as S. Then, the model can be interpreted as a
regression on S dummy variables where the sum of the coefficients is one. In the case
of six seasons, the vector γ has dimensions 6-1. The first factor obeys the following
mathematical equation (S. L. Scott, 2021):

γt+1,1 = −
S∑

i=2
γt,i + ϵt, ϵt ∼ N (0, σ2). (3.5)

25


3. Methods

The last component of the BSTS model is a dynamic regression component. This
component adds a dynamic regression component to the model in which the coeffi-
cients change over time in agreement with a random walk. This dynamic regression
component is fit to the features that are engineered in this report.

To yield forecasts with the BSTS model, Bayesian model averaging is performed in
the way described in Subsection 2.4.3. From the posterior distribution, 2000 draws
are made and the forecasted value is precisely the average value of all the draws.

3.2.3 GAM
The GAMs are fit using the mgcv R package created by Wood (2011). To capture a
seasonal component, a new variable for weekdays is created as by Posch et al. (2022).
This variable will be referred to as wDay in chapter 4. The weekday variable is a
categorical variable that is a number from one to six, since Saturdays are removed,
indicating the weekday of the data-point. Then, a smoothing function is fitted to
the weekday variable where the basis used for the spline as a cyclic cubic spline is
specified. The cyclic cubic spline has a constraint that states that there should be
no discontinuity between the endpoints of the spline (Simpson, 2022). It is therefore
suitable for fitting a seasonal component.

Similarly, to fit a trend function to the data, the creation of a new variable is required.
The trend component requires that the time t is given in numeric values and therefore
a variable is created, similarly to Posch et al. (2022), called time that maps all the
dates in our data set to real integer numbers. A smoothing function is then fit to the
time variable to replicate a trend function. This is simply a function of time that
is included in the sum of functions that the GAM uses to produce predictions. A
smoothing function is also created for all of the regression variables, where the basis
used is a Gaussian process.

3.3 Feature engineering
The data that is used as the basis for the regression components of the models is
available shipment-specific data and event tracking data from Company X which
is based on parcel delivered for Company G. The information that is present in
the shipment information data set is summarized in Table 3.1. For event-specific
data, the information can be found in Table 3.2. The tables are combined on the
key variable shipmentID to link a specific shipment with its corresponding events.
New features are then created for the times between events by subtracting the event
time of a preceding event from its subsequent event. The thesis assumes that every
shipment has the chain of events shown in Figure 1.1.

The complete list of features that are engineered from the raw data at hand is
summarized in Table 3.3 with their corresponding descriptions. The features that
are engineered are based on the following assumptions:

26


3. Methods

Table 3.1: Shipment information table.

Column name Type Description
ShipmentID INTEGER A unique identifier for a shipment.
ShipDateLocal STRING shipping date in local time.
ReceiverCity STRING Shipping destination city.
ReceiverCountry STRING Shipping destination country.
ReceiverZipCode STRING Shipping destination ZipCode.
CarrierName STRING Name of carrier performing shipment.
ETA TIMESTAMP Estimated time of arrival.
ATA TIMESTAMP Actual time of arrival at destination.
NUTS3 STRING NUTS3 region code of shipping destination
DOT INTEGER classifying a shipment as on time (0) or late (1).

Table 3.2: Event tracking information Table columns.

Column Name Type Description
TrackingEventGlobalID STRING A unique identifier for a tracking event.
ShipmentID INTEGER A unique identifier for a shipment.
Description STRING Description of event.
Code STRING Event description code.
EventTimeUTC TIMESTAMP Registration time for event.
EventTimeLocal STRING Registration time for event in local time.

1. The variation in the amount of, both finished and unfinished shipments, that
are currently burdening the transport network have predictive ability towards
the share of late packages in the following time-period.

2. The daily variations in the average times between shipment events have predic-
tive ability towards the share of late packages in the following time-period.

3. The resulting observations of all variables are assumed to be realizations from
normal distributed variables.

The viability of the first two assumptions is examined by comparing the predictive
ability of the BSTS model and the baseline ARIMA model. By including a spike-
and-slab prior in the BSTS model, the significant variables in a predictive sense are
found by examining which variables have the strongest inclusion probability in the
resulting BSTS models. The viability of the last assumption is discussed in chapter 5.

Reading the Table 3.3 top-down, the first five variables represent shipment volumes
in different phases of a shipment during the day preceding the forecast. The last
four variables represent the average time between shipment events (TBE) a typical
shipment registers during its journey from department to end-location.

After the creation of the predictor variables, the variables are scaled by transforming

27


3. Methods

Table 3.3: Regression variables with corresponding description.

Variable Name Description

DepartedFromG The volume of packages that registered the
tracking event DepartedFromG.

ArrivalFirstHub The volume of packages that registered the
tracking event ArrivalFirstHub.

ArrivalDestinationHub The volume of packages that registerd the
tracking event ArrivalDestinationHub.

OutForDelivery The volume of packages that registered the
tracking event OutForDelivery.

Delivered The volume of packages that registered the
tracking event Delivered.

TBE1
The mean of the time duration between
tracking events DepartedFromG and
ArrivalFirstHub.

TBE2
The mean of the time duration between
tracking events ArrivalFirstHub and
ArrivalDestinationHub.

TBE3
The mean of the time duration between
tracking events ArrivalDestinationHub
and OutForDelivery.

TBE4
The mean of the time duration between
tracking events OutForDelivery and
Delivered.

28


3. Methods

them into observations from a standard normal distribution. This assumption is
supported by the central limit theorem. The theorem states the following: for a sum
of independent and identically distributed variables, each having a finite mean µ and
a finite non-zero variance σ2, then, the distribution of the standardized sample mean
tends to the standard normal distribution (Montgomery, 2019).

By grouping shipments by day, these become daily standardized sample means after
applying a standardization where the standard error and mean are estimated by
taking the mean and standard error of the whole population. This transformation
rests on the assumption that the times-between-events for shipments are independent
and identically distributed.

3.3.1 Missing time series data
All regions do not have a complete time series of observations of the target variable
for every day in the sequence. Therefore, before any fitting is done to the time
series, the missing data has to be accounted for. The process of filling in missing
data in a time series is called interpolation (Klosterman, 2019). In the case of time
series, temporal methods may be used. The general idea with these methods is that
a missing data point has presumably a value in-between adjacent data points. A
large set of interpolation methods are available in both NumPy and Pandas, which are
also used in this thesis to fill in missing values in the different regional time series.
This is due to the simplicity and convenience of the approach. It is noteworthy that
more sophisticated approaches to handling missing data exist where the problem
of missing data is approached as a “problem within the problem”, that is, to view
the missing data as a predictive modeling task. This is referred to as model-based
imputation by Klosterman, 2019.

3.3.2 Missing data statistics
Before missing values in the observed time series are filled with interpolation methods,
some initial pre-processing is done. After inspection of the different regional time
series with the resolution NUTS1, one can observe that in this particular case, the
majority of missing time series data are over Saturday deliveries. One explanation for
this is that the studied company does not perform deliveries on Saturdays. Another
explanation is that the deliveries that are performed on Saturdays are only registered
during other days of the week, for example on Sundays or Fridays. The share of
missing data in the various observed time series are presented in Table 3.4.

As one can observe, a large share of the missing time-series data occurs on dates
that are Saturdays. Therefore, before any interpolation methods are utilized to
handle the missing values, Saturdays are removed from the observed time series. The
practical usefulness of this procedure is that instead of obtaining ∼ 15%artificially
interpolated data in the regional time series after filling in missing values, ∼ 0.87%
of the data is removed in the time series. Now, in the resulting regional time series,
only ∼ 1.3% of the data has to be filled in by using interpolation methods. In other

29


3. Methods

Table 3.4: Share of missing time-series data.

Mean Standard Deviation
Share of Saturdays in regional
time-series data 0.87 % 0.22 %

Share of Saturdays in missing
regional time-series data 92.56 % 2.28 %

Total share of missing
data when including Saturdays 14.67% 0.46%

Total share of missing
data when excluding Saturdays. 1.29% 0.43%

words, if Saturdays are present in the resulting time series of observations that is to
be modeled, these values need to be filled in. If instead Saturdays are removed and
the time series only consists of the remaining six weekdays, less values need to be
filled in. The argument for implementing this data cleaning step is that the resulting
time series models will make better predictions but also be more grounded in reality.
This is due to the hypothesis that more artificial data will dilute the information
embedded in the actual data.

3.4 Model evaluation methods
Evaluation of data science models can be carried out in several ways. One way
to achieve this is by implementing out-of-sample prediction; that is, to forecast
values of future days using the trained models to predict beyond the training sample,
and comparing the results of the forecasts with actual values from a test set. The
performance of different models is compared by firstly calculating some kind of
statistical measure, and secondly by comparing the magnitude of said statistical
measure. Hyndman and Koehler, 2006 discusses several different statistical measures
to use for evaluating the accuracy of time-series forecasts. These are: root mean
squared error (RMSE), mean absolute error (MAE), mean absolute percentage error
(MAPE), and mean absolute scaled error (MASE). MAPE is useful when the scales
or the unit of measure of the time series vary. Since the measure takes into account
the size of the target variable to scale the result, it must be non-zero in every case.
Since this is not true in our case, the measure is omitted since it would result in
undefined values. This leaves us with three statistical measures for accuracy: the
RMSE, the MAE, and the MASE. The quality of the forecasts is evaluated in two
differing ways: The first one is to split the data into one training set and one test
set for every region. Then, the three different models are fitted to the training set.
Next, ten future forecasts are generated and compared with the actual values (in the
test set) by using the chosen statistical measures.

The second way of evaluation is by the use of time series cross-validation. The time
series cross-validation methodology is summarized in Figure 3.5. The prediction
accuracy is evaluated by a procedure sometimes called “evaluation on a rolling

30


3. Methods

Figure 3.4: Traditional train and test split.

Figure 3.5: Train and test split for cross validation.

forecast origin” (Hyndman, 2022). The method works by averaging over forecasts
each with a horizon of size one, where one starts by fitting the model to a part of
the data, and then producing a forecast. Next, the forecast “rolls forward” in time
by one step and the training set increases its size by one. This procedure is then
repeated ten times and finally, the prediction accuracy is averaged over all of the
fitted models’ forecasts. Since the size of the forecasting horizon is one, The RMSE
and MAE will show the same result because the RMSE for a sample of size one is
exactly the MAE (eq. 3.6 and eq. 3.7). Since the third statistical measure, MASE is
a relative measure, it cannot be used in this case (Hyndman and Khandakar, 2008).
Therefore, the statistical measure that will be used for evaluating the cross-validation
folds is RMSE/MAE.

3.5 Statistical Measures
The definitions of the three statistical measures used will be elaborated upon in this
subsection.

3.5.1 Root mean squared error
Historically, RMSE, has been very popular. This is largely due to its relevance in
statistical modeling. However, it has been criticized for its low reliability (Hyndman
and Koehler, 2006; Armstrong et al., 1992). Despite the criticism, RMSE will be
included as one of the statistical measures here due to its widespread adoption and

31


3. Methods

understandability. The RMSE is defined as:

RMSE =
√√√√ 1

n

n∑
j=1

(yj − ŷj)2. (3.6)

Where ŷj is the j-th predicted value, yj is the j-th actual value and n the number of
observations.

3.5.2 Mean absolute error
MAE is a statistical measure that also has seen widespread adoption in the domain
of statistical modeling (Hyndman and Koehler, 2006). MAE is in concept simpler
and has an easier interpretation than the RMSE. In a time-series plot, it is simply
the average absolute distance between each actual value and the predicted value at
any given time. It is less sensitive to outliers in predictions in contrast to the RMSE,
in which outliers tends to increase the RMSE score since the results are squared
(Willmott and Matsuura, 2005). The MAE is defined as:

MAE = 1
n

n∑
j=1

|yj − ŷj|. (3.7)

Where ŷj is the j-th predicted value, yj is the j-th actual value and n the number of
observations.

3.5.3 Mean absolute scaled error
MASE is a statistical measure also discussed by Hyndman and Koehler (2006). The
authors considers it the best available measure for forecasting accuracy. Measures
based on relative forecasting errors make an attempt to remove the scale of the data
by putting the forecasts in relation to some other forecast method that is used as
a benchmark. Usually the other forecast method is a naive method. The naive
forecast method implemented in this thesis is to simply forecast the value at time t
as the actual value at time t − 1. In other words, the next forecasted value is the
current measured value. The MASE values are also easy to interpret. A value that
is larger than one performs on average worse than the naive method and conversely
a value lesser than one indicates that the model performs better on average than the
naive method. A drawback of relative measures is that they can only be computed
when several forecasts are made on the same series. Therefore, it is not possible to
use MASE to measure out-of-sample prediction error when the forecast horizon is
only a single sample (Hyndman and Koehler, 2006). When scaling the errors by the
in-sample MAE from the naive method, the scaled error is defined as:

qt = yj − ŷj
1

n−1
∑n

i=2|Yi − Yi−1|.
(3.8)

Where the numerator is the MAE of the forecasted value for given period and the
denominator is the MAE of the naive method for given period. The MASE is defined
as:

MASE = mean(|qt|). (3.9)

32


3. Methods

Figure 3.6: Comparing tools and techniques used by us with what is used by data
scientists at Company X.

3.6 Implementing a software engineering frame-
work

To answer research question 2 and 3, a workflow will be followed which is shown in
Figure 3.6.

The first step is to create a framework where tools and techniques utilized during
the model development phase are connected to each software engineering principle.
This framework will be referred to as framework 1. The purpose for using the
tools and techniques is to ensure that the software engineering principles discovered
during the literature review are followed. They specific way that these principles are
implemented in model development is outlined in the framework and with what tool
or technique they are achieved.

The second step is to investigate tools and techniques utilized by the data scientists
at Company X. In the same manner as in the previous step, a framework referred to
as framework 2 is developed connecting software engineering principles to tools and
techniques used by data scientists at Company X. It is important to first develop
framework 1 before developing framework 2 since the tools used in framework 1
will be purely based on tools and techniques found in the literature review and
these should not be influenced by the current standard practices of data scientists
at company X. Framework 2 will be developed by conducting individual interviews
with data scientists at Company X.

The last step is to compare framework 1 to framework 2 by discussing similarities
and differences. Software engineering principles might be fulfilled differently in
the two frameworks. The discussion’s main purpose is to find gaps between the
theory of this thesis and how Company X currently operates. This step provides a
discussion of application possibilities of software engineering principles in a real-world

33


3. Methods

environment. Verifying the application of software engineering principles in real data
science projects is lacking in current research according to Al-Sarayreh et al. (2021).

3.6.1 Understanding the current data science development
process

As mentioned in the previous section, the current data science development process
at Company X needs to be mapped out to create framework 2. According to Stol
and Fitzgerald, 2018, interviews are a good method to carry through a field study
research. Another common technique is field observations. Due to time constraints,
field observations will not be used in this study.

The purpose of the interviews is to understand the current model development pro-
cesses of data science teams at Company X. Firstly, an understanding of whether the
data scientists at Company X are aware of the eight suggested software engineering
principles is studied, and if so, what tools or techniques they use to follow said
principles.

Furthermore, a basic description of the background of each interview subject will be
described by the interviewees to nuance reasoning as to why different data scientists
have a different view on software engineering principles or why they use different
tools or techniques to achieve software engineering principles.

The interviews will be semi-constructed, meaning that the interview will not be
constrained to pre-determined questions. The purpose of the interview questions is to
set a guideline for the interview (Bell et al., 2019). The reason why semi-constructed
interviews will be used is because the interviews aim to gather opinions and thoughts
on proposed software engineering principles. Furthermore, the interviewees will be
asked to reflect on these discussing which tools they use to achieve said software
engineering principle. Opinions are more easily expressed in a semi-constructed
interview in comparison to a constructed interview format (Bell et al., 2019).

The interview subjects were selected by asking all of the eight employees at Company
X with data scientist as their title, and interviews were held with the available
subjects. The study chooses to not interview any data scientist with high seniority
since their tasks fall more under project management rather than actual model
development. Out of the eight data scientists, six were available for interviews during
the time of asking. Since six out of eight data scientists were interviewed at Company
X, the chosen sample can be described as saturated for the given context. In other
words, all potential interview subjects were requested to participate in an interview.

34


4
Results

This chapter outlines the results of the thesis. The results of applying ARIMA, BSTS,
and, GAM in the given context with model performance statistics are presented
along with used tools and techniques when applying software engineering principles.
Results of the conducted interviews are also presented.

4.1 Time series models

In the following section, a few illustrative time series are plotted. when creating
individual time series for specific pairs of region and service type, nine of the time
series are found to only have data for the first two to five months. These are omitted
from the study resulting in a total of 23 time series to examine. Every omitted time
series is of the service type A.

The regional variability for share of late packages for service type B is very low
and is shown in Table 4.1. The mean for share of late packages is around 0.09 for
service type B. The regional variance of service type A is larger, shown in Table 4.2.
The regions DE8, DE9 and, DEA have distinctively higher mean for share of late
packages than the rest of the regions with service type A. The regions with high
mean share of late packages are close to each other, located in the north west parts
of Germany.

The printed graphs have the same features. The graphs are cut-off at day 150 and
day 270 for illustrative purposes. The models are fitted to the first 260 observations
of the various time series. Forecasts are then made for the next consecutive ten days.
The red line represents the actual data and the blue dashed line represents the fit of
the models. The dotted vertical line indicates where predictions start. Note that the
studied data contained values for a year (365 days), beginning in January 2021 and
ending in December 2021. The given data is preprocessed by removing Saturdays
meaning the input data into our models has around 310 days. Predictions are made
on day 260 since the study wanted to avoid making predictions in December, where
the data is irregular due to holiday seasons. Instead, predictions are made on dates
that correspond to the end of October or the beginning of November.

35


4. Results

Table 4.1: Table containing the mean for share of late packages for service type B.

Region Mean share of late packages
DE1 0.0901
DE2 0.0851
DE3 0.0751
DE4 0.0885
DE5 0.0990
DE6 0.0878
DE7 0.0787
DE8 0.0960
DE9 0.0810
DEA 0.0850
DEB 0.0905
DEC 0.0744
DED 0.109
DEE 0.0891
DEF 0.0550
DEG 0.0883

Table 4.2: Table containing the mean for share of late packages for service type A.

Region Mean share of late packages
DE1 0.122
DE7 0.0400
DE8 0.224
DE9 0.191
DEA 0.213
DEB 0.0554
DEF 0.0511

36


4. Results

Figure 4.1: ARIMA model results for region DE1 with service type A.

4.1.1 Models in Region DE1 with service type A
For many regions, the typical behavior of the time series is a share of late packages
around zero with occasional spikes of late packages as seen in Figure 4.1, 4.2 and 4.3.

For these cases, the ARIMA model does not catch any seasonality or trend as seen
in Figure 4.1 and only fits a straight line based on the mean of the time series to
minimize its predicted errors. The chosen ARIMA model is an ARIMA(0,0,0) with
a non-zero mean, meaning that the chosen parameters in the model consist of only a
constant mean, putting no weight on the autoregressive part or the moving average
part.

Neither does the BSTS model catch the spikes, despite it not fitting a straight line.
This shows that the regression variables have an effect on the model but as shown in
Figure 4.2, the spikes of the observed time series are never captured by the model.

The same problem arises when fitting the GAM model which shows a poor fit for
the spikes as seen in 4.3. The GAM model does not have any significant variables,
which in this case is defined as a P-value below 0.05.

The RMSE and MAE score is the best for the BSTS model as seen in Table 4.3.
GAM scored better than the ARIMA model with respect to the MAE measure but
worse with respect to the RMSE measure.

37


4. Results

Figure 4.2: BSTS model results for region DE1 with service type A.

Figure 4.3: GAM model results for region DE1 with service type A.

38


4. Results

Table 4.3: Table containing the the statistical measurements for region DE1 with
service type A.

Region DE1 with service type
Service type A

Model RMSE MAE MASE
ARIMA 0.0302 0.0302 inf
BSTS 0.001 0.001 inf
GAM 0.034 0.026 inf

Figure 4.4: ARIMA model results for region DEA with service type A.

4.1.2 Models in region DEA with service type A
In a few regions, the share of late packages is consistently high across the whole
observed period. An example of a region that displays this kind of behavior is DEA as
seen in Figures 4.4, 4.5 and, 4.7. It is clear that these time series are better-behaved
for modeling purposes. They display a seasonal component and in general more
variation in the data. Since these time series contain more variation, the models are
capable of fitting this variation in greater quality.

The chosen ARIMA model for this pair of region and service type is ARIMA(5,1,1).
When observing the plot of the ARIMA model (Figure 4.4), one can see that the
model captures the seasonality in the data, but the model has a clear upward going
trend despite the data being differenced once in the chosen model. Therefore, the
ARIMA model predicts increasingly larger values further along the time-axis.

39


4. Results

e
Figure 4.5: BSTS model results for region DEA with service type A.

The BSTS model obtains the best fit to the time series in this case in terms of
MAE, RMSE, and MASE (Table 4.4). The BSTS model also has an upward trend
component (Figure 4.6), but in contrast to the ARIMA model, it also combines
a dynamic regression component when producing its predictions. This results in
predictions that are more reliable and accurate. The seasonal component seems to
compensate for the upward direction in the trend component and therefore, the BSTS
model is capable of producing predictions that are alternating between lower values
and higher values in contrast to the ARIMA model which produces consistently high
values in its predictions.

The four different components of the BSTS model that are combined to produce the
predictions are displayed in Figure 4.6. As mentioned, the trend component has an
upward trend with respect to the time axis. The seasonal component has a periodicity
of 6, which corresponds to the six different days of the week that are included in the
time series (since Saturdays are omitted). It is dynamic and therefore changes over
time. It starts with a rather small amplitude, which then grows with time, similar
to the actual time series. The Regression component displays the state contribution
of the regression part of the fitted model. The Auto regressive component displays
an AR(3) process which the Auto.Ar function has decided upon. This process takes
into account the value of the target variable from the three preceding days when
producing its contribution to the final predicted value.

A visualization of the GAM’s fit to the time series is displayed in Figure 4.7. As one
can observe, the GAM captures a seasonal component of the data together with an
upward trend. However, since the seasonal component is static and not changing

40


4. Results

Figure 4.6: Components of the fitted BSTS model over region DEA with service
type A.

Table 4.4: Table containing the the statistical measurements for region DEA with
service type A.

Region DEA with service type
Service type A

Model RMSE MAE MASE
ARIMA 0.311 0.256 0.612
BSTS 0.226 0.199 0.477
GAM 0.283 0.258 0.618

as in the case of the BSTS model, it fails to predict high and low values with the
correct amplitude. Instead, it fits a constant seasonal component that results in
underwhelming predictions for both high and low values. The additional smoothing
functions that are fit to the predictor variables seem to be of little to no help in
aiding the GAM to produce accurate predictions. Only the wDay variable and time
variable representing a seasonal and trend component where assigned a significant
P-value.

4.1.3 Models in Region DE7 with service type B
Figure 4.8, 4.9 and, 4.10 illustrates the models created for region DE7 with service
type B.

As for the case with region DE1 with service type A, the ARIMA model is more or
less a straight line. This time, the chosen model is an ARIMA(2,0,0) with a non-zero

41


4. Results

Figure 4.7: GAM model results for region DEA with service type A.

Table 4.5: Table containing the the statistical measurements for region DE7 with
service type DPDf Priority.

Region DE7 with service type
Service type B

Model RMSE MAE MASE
ARIMA 0.018 0.014 0.715
BSTS 0.013 0.010 0.499
GAM 0.020 0.016 0.816

mean, meaning it contains an Auto regressive part and a mean value. As seen in
Figure 4.8, the fitted line is not completely straight and dips at around day 170.
Although it chooses an ARIMA(2,0,0) model, its predictions are the worst out of the
studied models for this region and service type.

The BSTS model on the other hand manages to predict a spike at around day 265 as
the actual data shows in Figure 4.9.The region DE7 with service type DPD Parcel
Connect illustrates that the BSTS model might have the best predictive ability of
irregular spikes due to its regression variables.

The GAM model also performs poorly for the given region and service type shown
in Figure 4.10. As seen in Table 4.5, the GAM model performs the worst for the
given region service type.

42


4. Results

Figure 4.8: ARIMA model results for region DE7 with service type B.

Figure 4.9: BSTS model results for region DE7 with service type B.

43


4. Results

Figure 4.10: GAM model results for region DE7 with service type B.

4.2 Model evaluation
The three different models are evaluated, as previously stated, by comparing the
three different statistical measurements RMSE, MAE, and MASE. The results of
the fitted models and their accuracy are summarized in Table 4.6. These tables
show the statistical measures in