Utilizing Computer Vision for the
Analysis of Manufacturing Processes

Evaluating Automation Possibilities for the AviX Software Suite

Bachelor’s thesis in Computer science and engineering

Oussama Anadani
Herman Bergström
Gustav Fåhraeus
Oscar Helgesson
Simon Svensson
Hanna Tärnåsen

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2021


Bachelor’s thesis 2021

Utilizing Computer Vision for the
Analysis of Manufacturing Processes

Evaluating Automation Possibilities for the AviX Software Suite

Oussama Anadani
Herman Bergström
Gustav Fåhraeus
Oscar Helgesson
Simon Svensson
Hanna Tärnåsen

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2021


Utilizing Computer Vision for the Analysis of Manufacturing Processes
Evaluating Automation Possibilities for the AviX Software Suite
Oussama Anadani Herman Bergström Gustav Fåhraeus Oscar Helgesson Si-
mon Svensson Hanna Tärnåsen

© Oussama Anadani, Herman Bergström, Gustav Fåhraeus, Oscar Helgesson, Si-
mon Svensson, Hanna Tärnåsen 2021.

Supervisor: Pedro Petersen Moura Trancoso, Department of Computer Science and
Engineering
Advisor: Oskar Ljung, Solme AB
Examiner: Sven Knutsson, Department of Computer Science and Engineering

Bachelor’s Thesis 2021
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Image modified using the created applications face blur and pose estimation
functionality. Source adapted from [1].

Typeset in LATEX
Gothenburg, Sweden 2021

iii


Utilizing Computer Vision for the Analysis of Manufacturing Processes
Evaluating Automation Possibilities for the AviX Software Suite
Oussama Anadani, Herman Bergström, Gustav Fåhraeus, Oscar Helgesson, Si-
mon Svensson, Hanna Tärnåsen

Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Throughout the manufacturing industry, video recordings are used to help stan-
dardize work and develop training material for companies. Solme AB develops a
software suite named AviX which aids in the analysis of these recordings. This re-
port aims to evaluate how computer vision technology could be utilized to increase
the functionality of the AviX suite. Furthermore, the report will evaluate how the
technology could be used to automate analysis currently performed manually in the
program. The evaluated features are face blur, tool highlighting, ergonomic risk de-
tection, and footstep counting. A software platform is developed in Java, primarily
with the use of OpenCV, to serve as a proof-of-concept for Solme. To support the
flexibility of changing the set of enabled features, the application was constructed
modularly and the features were implemented independently. The thesis concludes
that there is potential to extend the functionality of the AviX suite by utilizing com-
puter vision. Automated face blurring has been achieved with a considerable success
rate, increasing the privacy of people appearing in the video recordings. Moreover,
the automation of ergonomic risk detection showed promising results which indicate
that manually performed analysis can indeed be automated.

Index Terms—Artificial intelligence (AI), computer vision, ergonomic risk, face blur,
face detection, neural networks, object detection, pose estimation, step counting,
tool highlighting.

iv


Sammanfattning
Inom tillverkningsindustrin används videoinspelningar frekvent för att standardisera
arbetsuppgifter samt för att fungera som utbildningsmaterial internt inom företag.
Solme AB utvecklar en mjukvaruplatform, vid namn AviX, som syftar till att un-
derlätta analysen av dessa inspelningar. Målet av denna rapport är att evaluera hur
datorseende kan användas för att utöka funktionaliteten hos AviX. Vidare kommer
rapporten evaluera hur teknologin kan användas för att automatisera analys som för
tillfället utförs manuellt i programmet. Funktionaliteten som utvärderas är ansikts-
blurrning, verktygsmarkering, detektering av ergonomisk risk, samt stegräkning. En
mjukvaruplatform har utvecklats i Java, primärt med hjälp av OpenCV, för att an-
vändas som ett koncepttest åt Solme. För att få flexibilitet och göra det lätt att välja
vilken funktionalitet som ska vara aktiverad har applikationen byggts modulärt och
de olika funktionerna har implementerats oberoende av varandra. Rapporten drar
slutsatsen att det finns potential att utöka funktionaliteten hos AviX genom att ut-
nyttja datorseende. Automatisk blurrning av ansikten har uppnåtts med betydande
framgång, vilket ökar integriteten hos personerna som visas i videoinspelningarna.
Dessutom visade detekteringen av ergonomisk risk lovande resultat vilket indikerar
att manuellt utförd analys kan automatiseras.

Nyckelord—Ansiktsblurrning, ansiktsigenkänning, artificiell intelligens (AI), datorseende,
ergonomisk riskbedömning, neurala nätverk, objektdetektering, stegräkning, verk-
tygsmarkering.

v


Acknowledgement
We would first and foremost like to thank our supervisor Pedro Petersen Moura
Trancoso. The thesis would not have been possible without your positive energy,
ideas, and constant will to help push the project forward.

Furthermore, we would like to thank Oskar Ljung for his continued support through-
out the project. Your constant interest and cooperative spirit ensured the project
stayed on the right path. We also want to thank Uma Shankar Subramani and the
rest of the AviX team for their guidance in the construction of the application.

vi


Contents

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Ethical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Artificial Intelligence (AI) . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Machine Learning (ML) . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Neural Networks (NN) . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The AviX Software Suite . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Features for AviX 9
3.1 Face blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Tool Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Ergonomic Risk Detection . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Footstep Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Method 13
4.1 Work Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Utilized Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Software Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.1 Face Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.2 Ergonomic Risk Detection . . . . . . . . . . . . . . . . . . . . 16

5 Implementation 17
5.1 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Evaluators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.3 Pose Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 Image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Feature Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.4.1 Face Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.2 Tool Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii


Contents

5.4.3 Ergonomic Risk Detection . . . . . . . . . . . . . . . . . . . . 24
5.4.4 Footstep Counting . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Results and Discussion 27
6.1 Face Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Tool Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Ergonomic Risk Detection . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Footstep counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Conclusions and Future Work 32
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 34

A Appendix 1 I
A.1 Library Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.1.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.1.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

viii


1
Introduction

Video recordings are used in the manufacturing industry to help standardize work
and develop training material for companies [2]. By analyzing these recordings, com-
panies can also restructure their workstations to provide a more efficient workflow,
as well as to reduce poor ergonomic conditions for the workers.

Solme AB [3] is a software company specialized in video analysis of industrial pro-
cesses. Their product, a software suite named AviX [4], offers tools for ergonomic
analysis and optimizations of manufacturing processes. AviX is based on recording
the work process, timing the tasks, and classifying their utility. When the process is
visualized, several modules enable evaluation and optimization. These modules pro-
vide tools for line balancing and execution optimization [5], failure mode and effects
analysis (FMEA) [6], ergonomic analysis, single-minute exchange of dies (SMED)
[7] analysis and product design.

1.1 Purpose
Recently, Solme has been approached by customers who have had concerns regarding
the privacy of their employees. The company has as such begun exploring different
features and solutions that would make the manufacturing workers more comfort-
able with the video analysis. One of the suggested solutions is to implement a
feature that would enable users of AviX to blur the faces of people who appear in
the recordings by utilizing computer vision.

When discussing the possibility to use computer vision technology to blur faces,
Solme also showed interest in exploring other ways the technology could be utilized
in their software suite. As a result, they want to investigate if it would be possible
to automate the analysis currently performed manually by users of AviX.

The purpose of this thesis is to explore how computer vision can be used to im-
prove upon or create new, features for the AviX software suite. The primary feature
will be the blurring of faces, but together with Solme an additional set of analytical
features to examine has been decided upon. These additional features are: 1) tool
highlighting; 2) ergonomic risk detection; and 3) footstep counting. The functional-
ity of these features will be explained in detail in Chapter 3. The goal is to produce
a prototype that can serve as a proof-of-concept for Solme from which they can
gain insight into the potentials and limitations of utilizing computer vision to offer

1


1. Introduction

these new features. As different analysis features will be explored and implemented,
the aim is to build a modular platform that allows the user to easily change which
analysis is to be executed.

1.2 Problem Description
The problem posed in this thesis is to produce a proof-of-concept platform using
openly available AI models to implement features desired by Solme. The main re-
quirement for the platform is that it should be created with Java. Furthermore,
it should utilize openly available resources in the form of code and classes, hence-
forth referred to as open-source libraries. The features that are to be implemented
are:

• Face Blurring - The ability to blur faces that appear in the video. This will
be the primary feature.

• Tool Highlighting - Highlight tools used in the video for educational pur-
poses.

• Ergonomic Risk Detection - Timestamp moments in the video containing
non-ergonomic motions.

• Footstep Counting - Count the number of steps taken in a video.

The platform should be able to take a video file as input and output a modified ver-
sion of the file. Furthermore, it should output the results of the analysis performed
in an appropriate document format. This could for example mean outputting a
video with faces blurred and a document containing timestamps for non-ergonomic
motions. Additionally, the user should be able to select which analysis is to be per-
formed. As the focus of this project is not to build a graphical user interface (GUI),
command-line arguments will be used.

The proof-of-concept is supposed to work as a standalone application. The goal
is not to integrate any functionality into the AviX suite at this point, but rather
build a platform in Java that could be used as a reference if Solme wishes to pursue
these features.

1.3 Ethical Evaluation
Computer vision applications today can capture exceedingly detailed and personal-
ized information. This has given rise to ethical concerns as the technology can be
used for different purposes. For example, developers of computer vision have collab-
orated with authorities to monitor people of a specific ethnic group [8]. Furthermore,
there is also the issue of people using computer vision to generate synthetic media
where the person appearing in the video is manipulated to look like someone else,
commonly called ’deepfakes’ [9].

Even though this work is vastly different from the examples above, there is un-

2


1. Introduction

derstandably a negative stigma attached to computer vision systems [10]. People do
not generally like being surveilled, and it should be fully within their rights to choose
not to participate in the videos. In the case of this project, the people appearing
on video are consenting adults who have been informed of what the footage will be
used for. Furthermore, different use cases of computer vision have different ethical
implications. As such, the implications of each feature that is to be implemented
will be discussed in the following sections.

The blurring of faces is a feature with a large social aspect. According to the recent
EU General Data Protection Regulation (GDPR), the face is a person’s most visible
and primary form of identification [11]. As such, the face blurring of manufacturing
workers can be seen as a measure to protect their privacy. However, it should be
noted that face blurring does not ensure total de-identification. Identifying indi-
viduals through other personal traits such as their bodily physical appearance and
clothing is still possible.

In addition, there are outcomes of the face blurring feature which might not be
in the interest of the workers. Firstly, studies show that blurring faces not only de-
identify them but also de-humanizes them [12]. This can subsequently decrease the
watchers’ ability to relate to the participants of the video. Secondly, it is important
to recognize that Solme’s interest in face blurring is not inherently to protect the
privacy of manufacturing workers. Instead, the goal is to make these workers feel
more comfortable with being recorded by offering them a sense of privacy.

In contrast, the tool highlighting feature is different in that it does not analyze
any individual per se. It only looks for inanimate objects, not people. It could be
argued that the feature is in the interest of the employees as they are presented with
a clearer learning experience. However, even if it is not the intended effect of the
feature, it might work as an additional form of surveillance. The feature could po-
tentially identify an object that a person is not supposed to be using. For example,
it could identify a phone being used by an employee, even though the company has
a strict no-phone policy. This could in turn give the employees the feeling that they
are being monitored.

Lastly, it is hard to identify problematic scenarios for the ergonomic risk detec-
tion and step counting features. The ergonomic analysis feature is meant to benefit
both the employees and the employers. Working extended periods at an assembly
line that has poor ergonomic conditions can lead to long-term injuries [13]. This is
naturally undesirable for the employee but can also be expensive for the employer.
With this in mind, it is easy to see how identifying which stations have poor er-
gonomic conditions is of great interest. In the case of the step counting feature, a
positive aspect could be ensuring that the manufacturing process allows the workers
to walk an appropriate distance each day. It may seem obvious that walking too
much is taxing on the body, but so is walking too little [14].

3


1. Introduction

1.4 Thesis Outline
After this introduction, the thesis will start by offering background knowledge to
important areas in Chapter 2. In this chapter, an overview of what computer vision
is will be presented, as well as a quick introduction to the AviX software suite. Fur-
thermore, the section will cover work related to this project.

In Chapter 3 the features will be discussed in detail. The functionality of each
feature will be defined, while also offering explanations as to how AviX is to benefit
from them.

Next, in Chapter 4 the work process will be described. This will include descrip-
tions of how the libraries were evaluated, how features were explored as well as how
testing was performed.

A rundown of the software created will be presented in Chapter 5. The section
will cover the program structure as well as the implementations of each feature.
The program flow when processing a video will be explained to offer an overview of
the prototype.

The results and discussion presented in Chapter 6 will firstly focus on how well
the proof-of-concept works as a whole. It will then subsequently go over each fea-
ture and discuss its performance.

Lastly, in Chapter 7 the report will offer some conclusions and ideas for potential
future work. Using the results from the previous chapter, the section will summa-
rize the issues that Solme would encounter if the software of this kind was to be
implemented in their AviX suite. Furthermore, it will offer closing thoughts as to
which features could work well given enough effort, and which features may prove
more difficult to implement.

4


2
Background

This section aims to provide the reader with the essential background knowledge
needed to understand the content of the thesis.

Firstly, key concepts surrounding computer vision used within the project will be
introduced and explained to aid the reading comprehension during the technology-
focused sections of the thesis.

Furthermore, a brief introduction of AviX will be presented so that the reader can
understand the connection between the project and the software suite.

Lastly, a brief presentation of previous solutions to similar problems will be shown
in order to give the reader an overview of previous work performed in the area.

2.1 Computer Vision
Computer vision refers to the use of different methods within the realm of artifi-
cial intelligence to extract information from visible media such as digital images or
video. To classify as computer vision, the AI needs to not only extract information
but also do something with the information, such as recommend or take an action
[15].

Computer vision as a technology has been around for decades. The earliest im-
plementations of the technology (the late 1960s) were attempts to mimic the human
visual system [16]. This resulted in the technology that is today called ’image tag-
ging’, i.e. computers being able to categorize images based on what is predicted to
appear in them.

2.1.1 Artificial Intelligence (AI)
There is no generally accepted definition of what artificial intelligence, or AI, means.
In this project, AI is interpreted as the entire set of tools and technology that can
be used or interpreted in a way that bears a resemblance to human intelligence [17].
This project primarily refers to the detection of faces in videos and the estimation
of human poses. AI can as such be seen as a term that encapsulates many other
techniques and methods [18]. One such method being machine learning.

5


2. Background

Fig. 1. The relationship between artificial intelligence, machine
learning, and deep learning.
Source: Adapted from [17].

2.1.2 Machine Learning (ML)
Machine learning is generally described as mathematical methods constructed to
make predictions and optimization through the use of experience [19]. The term
experience is meant to symbolize data from which a method can learn how to do a
specific task and the quality and quantity of the incoming data points.

More concretely, in this project, ML refers to the way the methods used in dif-
ferent libraries are constructed. Different libraries used in aspects of this project
contain, or can use, different networks and algorithms to achieve a specific goal. For
example, the networks used to detect the position of faces in images are trained
through machine learning. Algorithms trained through ML are the key to perform-
ing the computer vision tasks this project requires [20].

Different algorithms that have gone through different training regimes differ in their
performance and efficiency. When optimizing an algorithm, it is provided a large
amount of data. In the context of face detection, an algorithm would be provided
with a large number of images containing faces as well as labels stating if and where
a face is located. The algorithm uses these images and labels to learn how to make
estimated guesses. Algorithms or networks optimized under certain conditions are
therefore also more accurate when predicting under these same conditions.

6


2. Background

2.1.3 Neural Networks (NN)
A subpart of ML is neural networks. Neural networks aim to imitate the human
brain through a collection of algorithms [21]. A neural network works by taking an
input, sending it forward to an arbitrary number of hidden layers (algorithms with
constants and coefficients that are optimized when training), before mapping it to
outputs representing different predictions. In the end, a neural network output can
range from being binary, such as yes or no, to a multitude of values mapping for
example different colors that can be identified in an image [22]. A neural network
that contains two or more hidden layers is usually referred to as a deep neural net-
work (DNN)[21].

Fig. 2. A neural network featuring a three dimensional input
layer, a hidden layer and two types of output.

Source: Adapted from [23]

All this combined allows for inputs to be sent through a network with multiple
different paths to finally result in some sort of output that can be used to reach
conclusions or make predictions.

2.2 The AviX Software Suite
AviX is a software suite, created by Solme AB, that is used for optimization and
analysis of production and design processes. AviX consists of modules, with different

7


2. Background

modules for different purposes. For example, there is ’AviX Method’ and ’AviX
Ergo’, the first being their tool for method and time studies that come with a built-
in media player. The latter, ’AviX Ergo’, is a module that through video processing
technology analyzes the ergonomics of the workplace [24].

2.3 Related Work
In this section, related work to the work made in this thesis will be presented.
Firstly, work related to the face blurring feature will be presented. Secondly, related
work to the ergonomic risk detection will be highlighted. Thereafter, related work to
the tool highlighting and footstep counting features respectively will be introduced.

Face detection is used in many different applications for a multitude of reasons.
There is the use case where the face detection is subsequently followed by face blur-
ring. This is primarily performed because of privacy concerns of people appearing
in images or videos, which often is a necessity when using street-level images. The
most notable example of this is Google Street View [25, 26]. Furthermore, computer
vision can also be utilized for full-body de-identification, as a continuance to face-
blurring [27]. This is done in a way that retains the human features while allowing
for increased privacy of individuals.

Previously, software has been developed that aims to assist in the process of as-
sessing ergonomic risk. For example, Venkatesulu and Koundinya implemented a
solution during their master thesis project at Chalmers where they used wearable
sensors to assess ergonomic risk [28]. The sensors (accelerometers) were placed on
different segments of a person, and as this person was going about his or her work
the sensors measured the movement of the body. Using this data, information such
as the angles between certain joints was extrapolated and used to calculate a score
using the Rapid Entire Body Assessment (REBA) method. REBA is a tool that
uses generally applicable ergonomic standards which results in it being used in a
wide variety of working environments to quantify ergonomic risk [29].

Regarding tool highlighting, computer vision can be used for more than just de-
tecting which tools that are being used, it can also be used to evaluate the wear of
a cutting tool [30]. Using the tool in different ways and subsequently evaluating the
wear can lead to finding optimal usage methods for minimal wear. That is favorable
from an economical perspective.

In the case of footstep counting, historically some form of pedometer has been used
to keep track of a person’s footsteps. One paper that more closely relates to this
project presents a solution where footsteps were being counted by analyzing footage
of a person doing agility training [31]. No sensors were used and it had a 97% ac-
curacy while processing in real-time. The limitations of their solution are that the
camera has to be stationary, and the movement is constricted to a pre-configured
square where a single person’s movements are predictable.

8


3
Features for AviX

This chapter will explain the features in more detail. It will be going deeper into
what the features are and why they are useful for AviX. How these features ended
up being implemented in the program will be explained in Chapter 5.

3.1 Face blurring
The face blurring feature edits the frames in a video to blur faces. An example is
showcased in Fig. 3. The means of the feature is the anonymizing of facial features.
However, as mentioned in Section 1.3, more parts of a human can be used to iden-
tify the human such as clothes, posture, and body. Therefore, it is not complete
anonymization.

Fig. 3. Face detected with OpenCV’s Deep Neural Network
(DNN) module.

Source: Adapted from [32]

9


3. Features for AviX

Face blurring can be divided into two separate steps; where the first step is best
described as the use of computer vision for face detection, or, to identify human
faces in digital images [33]. The second step involves blurring the identified face in
the image.

Solme desires face blurring to be implemented into AviX due to customer inter-
est. Their customers are the companies where the filming takes place, and those
companies’ workers are the people being filmed in the videos for AviX. Hence, the
blurring is being added for them to feel more comfortable being filmed and therefore
be more inclined to participate in the videos.

3.2 Tool Highlighting
The term tool highlighting is a specific case of object detection that has tools and
machines as identifiable classes. The general purpose of object detection is to auto-
matically locate and categorize objects of interest in videos. Different objects can
be chosen to be categorized based on which objects are relevant to the intended use
case. An example of object detection is showcased in Fig. 4.

Fig. 4. Objects detected with OpenCV’s Deep Neural Network
(DNN) module.
Source: [34].

Tool highlighting can be implemented with DNNs trained on tools and machines.
The coordinates of objects as well as their names will be output from the DNN,
which can be used to automatically insert the bounding box and the name in each

10


3. Features for AviX

image of the video.

AviX is among other things, used for educational purposes. Therefore, highlighting
what tools and machines that are being used could aid the learning process for em-
ployees. To prevent unnecessary screen cluttering, it is important that this feature
can be turned off when the identification of tools is not of interest.

3.3 Ergonomic Risk Detection
The feature aims to detect when subjects on film are in positions of especially high
ergonomic risk. This is achieved through the use of key points to make wider as-
sumptions about when a person appearing in the video could potentially be under
higher risk for injury.

With the help of machine learning, it is possible to extract 2D coordinates of cer-
tain anatomical key points of people appearing in a video frame. This technology is
usually referred to as pose estimation. Examples of such key points are the shoul-
der, elbow, and eyes as can be seen in Fig. 5. Thereafter, these joints can be used
to determine if a person is in a high risk pose. The occasions where such a pose is
detected will then be timestamped with the purpose of being analyzed further.

Fig. 5. Pose detected by OpenPose.
Source: Adapted from [35].

11


3. Features for AviX

In the AviX Software suite, there is a module called ’AviX Ergo’ which aims to assist
Solme’s customers with ergonomic risk assessments by performing video analysis.
This process requires a person to look through all the video footage to detect high-
risk positions. The goal of this feature is to aid that person by using AI to analyze
the video, creating timestamps at potentially harmful positions, thus making the
process of assessing ergonomic risk quicker.

3.4 Footstep Counting
Footstep counting refers to automatically counting the number of footsteps taken
by a person in a video. As a metric for measuring physical activity, counting steps
has many benefits: 1) it is intuitive; 2) objective; and 3) they constitute a basic unit
of human ambulatory activity [36].

One way that the feature can be implemented is by using the same solution as
the ergonomic risk detection feature. With a 2D skeleton that is generated frame by
frame, a primitive footstep counter can work by analyzing ankle or foot movement.

The main use case for footstep counting is as a metric of efficiency. Since walking
does not directly contribute to parts of the manufacturing process being assembled,
it is desirable to minimize it. Walking is at times a necessary task to perform when
parts need to be moved. However, the parts could be stationed closer together
making the walking distance shorter. Calculating the number of steps enables the
possibility to evaluate different methods of organizing the manufacturing process
to find the optimal one. On the other hand, the feature could also be used for
ergonomic purposes. As mentioned in Section 1.3, walking too little can be taxing
on the body. By analyzing the results of this feature, companies can ensure that
workers take a sufficient amount of steps when working on a station.

12


4
Method

This chapter describes the methodology used to obtain the project results. Firstly,
the overall work process will be presented, followed by the methodology for select-
ing what libraries to use and which were chosen. Lastly, a summary of how the
evaluation of the software was conducted is given.

4.1 Work Process
The work was performed in subgroups, each with a different feature in focus, gen-
erally following the steps below:

1. Research a feature to gain an understanding of how it works and how it could
be useful.

2. Research how the feature could be implemented in a way that follows the
requirements.

3. Create an implementation of the feature.
4. Test and evaluate different AI models of the feature and explore what changes

and additions could be made to the software to yield better results.

To allow for this distributed work style, a common interface for the features was
designed. This led to the subgroups not having to worry about each other as long
as the code was implemented towards this interface. Furthermore, the common
interface supported the modularity of the software, allowing features to work in dif-
ferent configurations and the implementation being decoupled from specific feature
implementations.

4.2 Utilized Libraries
Finding suitable libraries was one of the first tasks performed. The libraries needed
to satisfy the requirements in the problem description, such as being compatible
with Java and aid in the implementation of desired features. During this process,
other factors were taken into accounts, such as costs, accessibility, extensibility, and
compatibility with other libraries.

Using the requirements set out, OpenCV (see Appendix A.1.1) was the primary
library chosen and used for the proof-of-concept. OpenCV was chosen primarily
as it is a well-rounded open-source computer vision library that supports Java and

13


4. Method

offers extensibility to multiple features. The support of many features is beneficial
as it eases the implementation.

The other chosen library was TensorFlow (see Appendix A.1.2). TensorFlow is
an ML-oriented library that offers a wide range of functionality, ranging from AI
model creation to running existing models. In this project, it was specifically used
to implement the ergonomic risk detection and footstep counting features. However,
the library has limited Java support and was primarily selected as it has prominent
examples of pose estimation implementations.

4.3 Software Evaluation
The purpose of the software evaluation during the project has been twofold:

1. Comparing the quality of the feature implementations.
2. Comparing the quality of different computer vision libraries and services.

The evaluation has been performed in parallel with the feature implementations.
Solme insisted on the exploration of multiple implementation routes before settling
on the one deemed preferable. These different routes took the form of using differ-
ent libraries and APIs in separate implementations of the same feature. They also
took the form of using different machine learning models in an otherwise identical
implementation.

As it became more clear which the most suitable library was, the evaluation be-
came more detailed and implementation-specific within that library. This testing
served to evaluate if the additions and/or changes being made to the application
improved the overall quality.

It is however important to note that the type of evaluation performed varies from
feature to feature. While detailed tests have been performed, some features have
only been evaluated at a higher level. The main reason for this is that evaluations
are being performed to gain an understanding of the performance of different fea-
tures. This means that the evaluation results should offer deeper insight into which
parts of the program work well and which parts do not. Evaluation that does not
offer any such new information is deemed redundant.

Not all features received an in-depth evaluation. Rather, only the face blur and
ergonomic risk detection features did. The method used when evaluating these two
in detail will be described in the coming sections. Tool highlighting and step count-
ing were only evaluated at a higher level. This is due to the features having more
apparent performance issues, rendering in-depth evaluation unnecessary.

14


4. Method

4.3.1 Face Blur
The evaluation of the face blur feature was performed manually. This was because
there is no available data set containing faces under the same conditions as the typ-
ical video provided by Solme. Because of this, an approach in which the feature was
evaluated based on how well it could detect faces in a pre-existing data set could
not be chosen. Instead, a manual frame-by-frame approach was chosen to evaluate
the quality of the face blur feature under appropriate conditions.

First, a suitable ten-second clip from an AviX video was chosen. Different clips
have been tested for different purposes which will be discussed shortly. After the
clip is selected, it is viewed frame-by-frame, for every frame noting how many present
faces should be detected. The video is then processed by the face blur implementa-
tion in question. Lastly, the output video is analyzed frame-by-frame again, noting
which faces were detected correctly.

When doing this, the results were saved and an accurately detected face was de-
noted as a true positive (TP), a non-face that was detected as a face was denoted
as a false positive (FP), and a face that was not detected was categorized as a false
negative (FN).

When it comes to the different clips selected for testing, sequences are chosen to
be representative of both good and poor face detection conditions. Good face de-
tection conditions are defined as follows:

• The people in the video are facing the camera.
• The people in the video are standing relatively still.
• The camera is stable and does not move around noticeably much.

Subsequently, poor conditions are defined as:

• The people in the video are facing away from the camera (while still being
identifiable).

• The people in the video are moving around rapidly.
• The faces are partially covered.
• The camera is shaky and moves around.

Face detection was evaluated on multiple occasions throughout the project, as it
is Solme’s highest requested feature. It was first evaluated to compare different li-
braries. During this process, simple face detection implementations were made using
each library. The implementations were then evaluated on the same clip of average
quality to gain an overview of how well the different libraries performed.

At the later stages of the project, face detection evaluations were performed to
compare different DNN models and see how well the proof-of-concept performed
under different conditions.

15


4. Method

4.3.2 Ergonomic Risk Detection
The evaluation of the ergonomic risk detection took two different forms: one with
controlled and chosen footage and another where the implementation was evaluated
on footage provided by Solme’s customers. In both cases, the questions posed were:

• How accurate is the time stamping of the movements?
• Are there any false positives (FP) and/or false negatives (FN) occurring?

The controlled footage was filmed using webcams and consisted of one group mem-
ber going through a "choreography" made out of all the movements that should
trigger a detection. The videos provided by Solme’s customers represented typical
footage that AviX would analyze. The feature was opted to be tested this way to
see how it performed under the best conditions that could be set up, as well as more
typical footage.

It is of interest to evaluate whether the detections are triggered correctly as right
or left regardless if the person is filmed from the front or from behind. To evaluate
this, the video and timestamps were compared to ensure the accuracy of the detec-
tions. Factors such as lighting and positioning need to be taken into account when
discussing the results generated from the tests.

During the evaluations, some situations require human evaluation to determine if
the timestamp of a detection should be considered correct. For this, the acceptable
time range was chosen to be one second. This means that the movement has room
to be somewhat misplaced. There is also subjectivity to deal with when it comes to
what is considered a stretch away from the body.

16


5
Implementation

This section will give an overview of the implementation of the program. An abstract
description of the structure will first be given to offer a general understanding of the
software. The implementation of AI models and their corresponding features will
then be explained in greater detail.

5.1 Program Structure
The program analyses a given input video and outputs a modified version of that
video. The output will vary depending on which features have been enabled. Addi-
tionally, the program outputs a JSON document containing eventual timestamps of
interest. During this process, the application iterates over every frame of the video,
processing one frame at a time. The analysis of a frame is performed by passing it
through the enabled evaluators. Exactly what an evaluator is and what the different
versions are will be explained in Section 5.2. For now, the evaluators can be sum-
marized as the classes in charge of running the images through the neural networks
and interpreting the results.

Although each frame is individually sent to the evaluators, there is still informa-
tion to be gained by comparing frames one to another. For example, a face being
detected in one frame implies a higher chance of the next frame containing a face in
the close-by area. To cater for this need, the program is constructed using buffers,
which save previous frames as well as previous results generated by the evaluators.
These buffers are limited in size and when they fill up different optimization algo-
rithms can be utilized to improve the results. These algorithms will be explained in
detail in Section 5.4. A flowchart of the abstract program structure which has been
provided can be seen in Fig. 6.

17


5. Implementation

Fig. 6. A simplified version of the program flow.

5.2 Evaluators
The evaluators are in charge of analyzing a given frame by passing it through dif-
ferent neural networks. After the network has finished processing the frame, the
evaluators interpret the output and each one generates a detection result. The data
contained within a detection result varies from evaluator to evaluator. The detection
result is subsequently added to that evaluator’s result buffer.

There are three different evaluators, each has its own detection result and each
has its own buffer. The evaluators are, Face Detection, Object Detection, and Pose
Detection. The relationship between evaluators, detection results, and result buffers
can be seen in Fig. 7.

18


5. Implementation

Fig. 7. A simplified UML displaying the relationship between
evaluators and their interfaces.

The evaluator structure displayed in Fig. 7 enables extension by making it simple
to add further computer vision analysis to the project. This would be accomplished
by firstly making a new class, which implements the ”OpenCVEvaluator” interface,
to handle the new AI model. Secondly, one would create a class that implements
the ”DetectionResult” interface to represent the result of the evaluator. Lastly, the
developer would create a class that extends the ”AbstractResultBuffer” class to store
the detection results created by the evaluator.

5.2.1 Face Detection
The face detection evaluator is used to detect faces in a given image. When the
evaluator is initialized it loads the DNN model which is to be utilized. The model
is interchangeable with models of similar input and output formats. When building
this proof-of-concept, two different models were tested: ResNet-10 [37] and Single
Shot Scale-invariant Face Detector (S3FD) [38]. These two models were chosen to
be tested because one is much faster than the other, which makes for more nuanced
results.

After the images have been sent through the neural network, the output is processed.
The detections generated by the model contain information about the confidence of
the detection, as well as the left, right, top, and bottom coordinates. A threshold

19


5. Implementation

is defined and used to filter out detections with too low confidence. The detections
with sufficient confidence are then added to the detection result.

The detection result instantiated by this evaluator is of the class "FaceDetectionRe-
sult". It consists of a list of rectangles, each rectangle representing a detected face.
Every rectangle has a top-left coordinate, as well as a width and a height. These
instances are then added to the face detection result buffer.

5.2.2 Object Detection
The object detection evaluator also uses DNN models. The DNN that was imple-
mented is called MobileNetSSD [39, 40], which detects 21 different common objects
such as animals and vehicles.

It has an output format similar to the DNN of the face detection evaluator. That is,
a rectangle around each detected object as well as the detection confidence. How-
ever, this model also outputs the name of the object that is detected. The detection
result is of the class "ObjectDetectionResult". This result contains rectangles, as
well as the names of the detected objects.

5.2.3 Pose Detection
The pose detection evaluator makes assumptions about the pose of a person in an
image based on the spatial locations of anatomical key points. The model used for
generating these key points is the neural network OpenPose [41]. OpenPose is an
open-source system for multi-person 2D pose estimation.

TensorFlow is utilized to run an image through the OpenPose model. The out-
put of the model is the coordinates for the detected key points. These key points
are then used to build skeletons that are encapsulated by the ”Human” class. As
the joints are also labeled, they can be used to build limbs for the skeletons. Lastly,
a detection result of the class ”PoseDetectionResult” is generated containing a list
of all detected humans.

5.3 Image processing
Although different AI models are utilized throughout the application, it is impor-
tant to remember that a model will always be limited by the image it receives. As
the project only explores openly available pre-trained models, many of the models
require the image to be of a different format than what it was originally captured
in. As such, images will have to be resized in order to match the input format of
the model. Consequently, the results given by the AI models will differ depending
on how these transformations are made.

OpenCV offers a resize function that is used throughout the application. This
function takes an argument that decides which interpolation method to use. If no

20


5. Implementation

such argument was given, it defaults to bilinear interpolation, which is the one used
in this project.

While these algorithms may do a good job of resizing the image, an issue arises
when the original image has an aspect ratio different from the ratio of the model.
For example, say that the input format of the AI model is 300x300, and the original
image is 600x400. Utilizing the OpenCV resize function at this point to transform
the 600x400 image to a 300x300 would lead to distortion. This will heavily affect the
performance of the models as every person in the image is subsequently distorted.
For example, the face detection models will have a harder time recognizing faces
with unusual proportions as they have not been trained on distorted images.

In order to resolve this issue, a resize function was implemented that would preserve
the ratio of the given image. The idea is that, while maintaining the proportions of
the original image, the method resizes it to its largest possible size without exceed-
ing any of the boundaries. In the above-given example, this would mean resizing
the 600x400 image to a 300x200. The function then pushes either rows or columns
of black pixels to the end of the picture in order to ensure that it matches the input
dimensions of the given model. An example of what this may look like can be seen
in Fig. 8.

(a) Original image

(b) Resized without aspect ratio preserved (c) Resized with aspect ratio preserved

Fig. 8. Resizing a 1600x896 image to 300x300 with and without
the aspect ratio being preserved.

21


5. Implementation

This solution has two major downsides. Firstly, the image is made smaller than
it otherwise would have been which makes detection more difficult. Secondly, the
network will analyze the black pixels which result in wasted processing time.

A favorable solution would most likely be to find a model (or adjust the current
one) to match the aspect ratio of the video recordings. However, as the purpose of
this thesis has not been to find or build the most suitable models, this was considered
to be outside of the scope.

5.4 Feature Implementation
The following sections will discuss how the different features are implemented. The
different ways the evaluators from Section 5.2 are utilized will be covered.

5.4.1 Face Blur
OpenCV offers functionality for the blurring of images using different algorithms. As
such, by using the face detection evaluator to get the position of faces in an image,
the blurring of faces becomes possible. This result can be seen in Fig. 9. However,
different alternatives have been explored to improve the face blurring quality. Two
optimization algorithms have been implemented in an attempt at blurring faces not
detected by the neural network.

Fig. 9. Image modified by the implemented face blur feature.
Source: Adapted from [42]

22


5. Implementation

The first of the algorithms use body parts detected by the pose detection evaluator
to estimate the location of a face. These body parts are namely the ears, eyes, nose,
and neck. Whenever one of these body parts is detected at a place where the face
detection found no result, a new detection is added. However, as of right now, the
algorithm only uses one key point as a reference. The exact location and size of the
face blur are therefore estimated.

The second attempt at improving face detection is by using an algorithm to reduce
flickering. When viewing the generated videos it was noted that, even though a face
has hardly moved, the blur sometimes disappeared for a few frames only to return
shortly. An example of this can be seen in Fig. 10. The idea was to add detections
for the frames that were missed by utilizing the buffer architecture. The algorithm
identifies if the first and last frames contain any similar detections. The comparison
is performed by checking if the first frame has face detections that overlap with
detections of the last frame. If this is the case, an assumption that the detection
should also be present in the frames in between is made. The application then loops
through all the frames in between, adding the detection from the first frame to each
one. The results of these improvements will be presented in Section 6.1.

Fig. 10. Four consecutive frames of a video processed by the
application. As can be seen, the third frame should

have a blur at approximately the same location as the
other frames.

5.4.2 Tool Highlighting
The tool highlighting feature was implemented using the object detection evaluator.
Since the evaluator stores information regarding the position as well as the title of
objects, highlighting the objects becomes a rudimentary task. Using this information
the application draws a rectangle around each detected object in the image. Each
drawn rectangle also has a label containing the name of the object placed next to
it. An example of this output format can be seen in Fig. 11. The cars are used due
to a lack of relevant data sets which will be discussed in Section 6.2.

23


5. Implementation

Fig. 11. Image modified by the implemented tool highlighting
feature.

Source: Adapted from [43]

5.4.3 Ergonomic Risk Detection
The ergonomic risk detection feature is implemented using the pose detection eval-
uator. An ergonomic assessment is performed on each detected human in a frame.
This assessment compares the coordinates of different key points to see if the hu-
man is currently in a potentially harmful pose. The definition of harmful poses, as
provided by Solme, are as follows:

• An arm is kept at or above shoulder level
– The wrist is above the shoulder
– The elbow is above the shoulder

• An arm is stretched out from the body
– There is a 45◦ angle between the elbow and shoulder, indicating a stretch

away from the body
• An arm is kept at or below knee level

– The wrist is below the knee
– The elbow is below the knee

If one of these poses is detected, a boolean indicating the pose will be set to true.
An example of a frame where such a pose is detected can be seen in Fig. 12.

24


5. Implementation

Fig. 12. Example frame where the wrist and elbow are detected
above shoulder level.

The application checks the pose detection result buffer to see if it contains any of
the non-ergonomic motions. If any such result is found, the program will check if
it has not recently created timestamp for the same motion. The reasoning behind
this is to avoid cluttering the output timestamp file.

When the application detects a new non-ergonomic motion, it creates a timestamp
with the name of that motion, such as ”left elbow below knee”. The program will
then check the current frame and use the frame rate of the input video to generate
the current time. The timestamp will then be added to a JSON object. At the end
of execution, the JSON object is written to an output JSON file. An example of
such a timestamp file can be seen in Fig. 13.

{
"Right arm stretch ":[5.5,16.1],
"Left wrist below knee":[11.8],
"Left arm stretch ":[3.6,14.2]

}

Fig. 13. Example of an output timestamp file. Each
non-ergonomic motion has its own array of timestamps
representing moments where the motion occurs in the

video. The timestamps are given in seconds.

25


5. Implementation

The JSON file is as of right now generated to demonstrate what the output format
of this feature could look like. Solme could for example in the future be able to use
this output to create timestamps in the AviX media player, allowing users to jump
between the non-ergonomic motions.

5.4.4 Footstep Counting
A simple version of footstep counting has also been implemented using the infor-
mation provided by the pose detection evaluator. It operates under the assumption
that there is exactly one person in the given video and that this person appears in
every frame.

To get the context of movement, two consecutive frames are compared against each
other. The step counter is incremented if an ankle is rising in the second frame but
not the first. To check if an ankle is rising, the application compares its y-coordinates
in the different images. The application looks at one leg at a time. After counting
a step taken with the right leg, it awaits a step taken with the left.

As of right now, the number of steps taken is displayed in the corner of the output
video. An example of this can be seen in Fig. 14. This output format could be
changed to fit Solme’s needs in the future.

Fig. 14. An example of the current output format of the step
counting feature.

Source: Adapted from [44]

26


6
Results and Discussion

The resulting prototype is a modular application that is adaptable. The modularity
results from the possibility to add or remove features without having to restructure
the program, since the features are not necessarily dependent on each other. How-
ever, there is the possibility to make the evaluators interact with each other by using
the buffers. For example, as mentioned in Section 5.4.1, a face blur optimization
utilizes body parts found by the pose estimation to make conclusions about the
presence of a face. Adaptability is also a characteristic of the program since it is
possible to switch between different DNN models depending on the use case.

All of the tests were completed using JDK 15 (Java Development Kit), OpenCV
4.3.0, TensorFlow 1.15.0, and OpenPose version 1.7.0. Further details on the face
blur test setup will be presented in its section. All feature results will be presented in
their corresponding sections as well as a discussion regarding performance, usability,
and limitations.

6.1 Face Blur
The face blur test results consist of two neural network models tested on two videos.
Both videos contain footage previously filmed for the AviX platform and are there-
fore relevant test material. Furthermore, the videos were recorded under good and
poor conditions, as described in Section 4.3. Good and poor conditions in the
videos should not be confused with the properties of the video files themselves. As
the videos are not allowed for public use, example images will not be included.

Firstly, the specifications of the computer that ran the tests are presented. Sec-
ondly, the properties of the test videos are provided in Table I. Thirdly, the results
are presented in Table II. fourthly, a discussion that interprets the results is held.

The computer system was equipped with an Intel i7-9700KF CPU @ 3.6 GHz 8
core processor, 16 GB of RAM, and was running Windows 10. The confidence
threshold was set to 40% in all tests.

In the provided tables, ’Sensitivity’ refers to the proportion of blurred faces to all
faces according to sensitivity = T rueP ositives

T rueP ositives+F alseNegatives
. FPS and FP stands for

frames per second and false positives, respectively. The entries labeled Optimized
refer to the attempted optimizations presented in Section 5.4.1.

27


6. Results and Discussion

Table I
Video properties of the good and poor conditions test

videos

Property Good Poor
Resolution 640x480 640x480

FPS 30 30
Frames 288 300
Faces 288 373

Table II
Results from the face blur tests

(a) Good conditions
Model Sensitivity [%] FP FPS

ResNet-10 100 3 16.0
ResNet-10 Optimized 100 3 11.3

S3FD 100 0 1.3
S3FD Optimized 100 0 1.3

(b) Poor conditions
Model Sensitivity [%] FP FPS

ResNet-10 53 41 16.3
ResNet-10 Optimized 68 66 12.4

S3FD 95 3 1.3
S3FD Optimized 95 32 1.4

As seen in Table II, the ResNet-10 model [37] is approximately ten times faster than
the S3FD model [38]. The speed gained from using ResNet-10 needs to be weighed
against its lower accuracy. The optimizations cause a significant speed decrease
when using the ResNet-10 model, but have no impact on the S3FD model. This is
because the amount of instructions added by the optimization code is proportionally
insignificant in comparison to the S3FD model instructions.

The speed of the models was measured by the execution time of the program. This
will change depending on the hardware that the program is being run on, so the
absolute values are not of interest. However, the relative speed is interesting as a
metric of comparison since this should not be as hardware dependent.

Every face was detected in the video with good conditions. In this scenario, the
ResNet-10 model is favorable to use since it saves processing time. However, in the
video with poor conditions, the S3FD model had a significantly higher sensitivity,
rendering it favorable.

The optimizations do not increase the number of true positives when the accu-
racy of the model is high, as is apparent from the good conditions test. However,
the optimizations can increase the sensitivity when the accuracy of the model is low,
such as the ResNet-10 model under poor conditions. False positives are more likely
to be added by the optimizations when the conditions are poor.

There is the option to raise the detection threshold, which will decrease the number

28


6. Results and Discussion

of false positives as well as the number of true positives. In the context of face
blurring, a false negative is considered more detrimental than a false positive. A
false positive could at most lead to an obstructed view of the manufacturing process,
while a false negative could lead to the person’s face being identifiable. This means
that it is acceptable to use a low detection threshold. Furthermore, it implies that
the optimizations are desirable to use for the ResNet-10 model.

6.2 Tool Highlighting
Although an evaluator that supports object detection was made, it was not possible
to find a model that supports relevant objects. Since the target videos are taken in
a factory setting, the application requires an object detection model trained at de-
tecting tools and machines used in these factory settings. However, tool highlighting
is still possible to achieve by fitting the current evaluator to a model that is trained
on relevant objects.

Furthermore, no data set containing relevant objects was found, which could have
been used to train a model. Available data sets where objects have already been
annotated include ImageNet [45], Visual Genome [46], COCO [47], Google’s Open
Images [48] and more. The problem with these data sets is that they only include
the most common everyday objects which are uninteresting for Solme’s use case.

Creating a data set requires a large amount of work since a multitude of pictures
has to be annotated with bounding boxes around objects of interest. Furthermore,
training a model on the finished data set would also prove time-consuming. Consid-
ering that Solme did not think this feature was of high priority, this was not pursued
further.

6.3 Ergonomic Risk Detection
A video was produced under controlled conditions where the application was able to
consistently detect the person going through all of the eight hazardous movements,
subsequently generating accurate timestamps. A similar video was produced under
dimmer lighting conditions where the poses were not as consistently detected. It
can as such be concluded that lighting is a factor when it comes to the accuracy of
the pose estimation.

It was noted when setting up the controlled video conditions that it is required
that the whole body part is present in the video in order for OpenPose to detect it.
The body parts that this occurs most frequently to are hands and feet. For some of
the hazardous movements to be detected, it is required that the hands are detected,
causing this to be an issue. As seen in Fig. 15, once the hands were outside the
screen the joints in the arms could no longer be accurately estimated subsequently
not appearing at all.

29


6. Results and Discussion

Fig. 15. Image where hands outside of the camera view renders
OpenPose unable to detect arms.

Similar issues were present in the video representing realistic conditions for a typical
AviX video. The lighting was less of a problem since factories are usually well lit.
However, issues occurred due to the angles the workers were filmed from and the
fact that they were covered by equipment, or partly outside the frame.

In conclusion, the program detected all movements of interest provided that Open-
Pose had detected the pose accurately. The accuracy of the skeleton is therefore the
factor that has room for the biggest improvements. Those improvements can to a
large extent be achieved with more careful filming.

6.4 Footstep counting
After the initial research period that took place during the first weeks of the project,
it was deemed that footstep counting would take a lower priority among the features
due to its complexity. No DNN was found that directly supports the implementa-
tion of footstep counting. Instead, the solution is based on analyzing the body parts
detected by the pose estimation.

Solme deems that a footstep has occurred every time a foot touches down on the
floor. By using the ankle coordinate generated by the pose estimation, footsteps
could be counted under very strict conditions.

• Only one person can be in the frame.
• The person has to be standing on the floor at the very start.
• The person has to be close enough for the pose estimation to detect them, but

far enough to see their lower body.
• The person should walk towards the camera at a constant distance from the

camera, meaning that the camera has to move backwards. For the reason that

30


6. Results and Discussion

they should remain the same size in the camera’s view.
• The person’s feet should only move in the camera view if they are walking,

meaning that the camera should not pan up and down.

Under these strict conditions, the implementation of footstep counting does work.
Unfortunately, the method of solely analyzing pose estimation provides very poor
results in realistic conditions. Additional footsteps will be counted when the camera
is panning up and down. If a person leaves the frame and comes back it is hard to
establish that they are the same person. It is also hard to keep track of multiple
people who stay on screen because of how the pose estimation misses detections. In
conclusion, the data provided by pose estimation is not enough to infer a proper
footstep count.

31


7
Conclusions and Future Work

In this chapter, conclusions that can be drawn from this work are presented. Addi-
tionally, potential future work is suggested.

7.1 Conclusions
The final product has fulfilled the requirements set out as a proof-of-concept for
Solme and has also highlighted key potentials and limitations. The goal was to
construct a modular platform that enables changing what type of analysis is to be
run with ease, which has been completed. Moreover, the final version of the primary
feature, face blurring, is deemed to perform well with 100% sensitivity and 0 false
positives in 288 frames under good conditions. Under poor conditions, provided the
right model, the feature performs slightly worse with 95% sensitivity and 32 false
positives in 300 frames.

In terms of what has been identified from the proof-of-concept, evaluation of the
face blurring and ergonomic risk detection underlines the potential that these two
features have. Additionally, obstacles for the step counting and tool highlighting fea-
tures have been identified and presented. The produced platform provides a strong
basis for further automating manual tasks currently performed by Solme.

Furthermore, the evaluation also presented limitations of using computer vision
to automate tasks. An especially prominent limitation is the quality of the average
video processed by AviX. Poorly lit spaces, low image quality, and shaky footage all
reduce the models’ ability to yield good results. Moreover, the lack of pre-trained
models that have been trained in a manufacturing setting leads to drawbacks. Lastly,
by only having access to a single camera angle, depth information is lost. This in
turn limits the possibilities of both the step counting and ergonomic risk detection
features.

7.2 Future Work
In terms of what future work that can be done, a solid foundation that Solme can
choose to continue working on if they so please has been created. Additionally,
further training models for the face blurring and tool highlighting features would
increase the performance of the end-product. This could be accomplished by train-
ing the models on data sets containing workers wearing helmets and industry tools

32


7. Conclusions and Future Work

respectively.

Moreover, if the software had access to depth, a general standard ergonomic evalua-
tion could be implemented rather than creating timestamps for potentially harmful
positions for later evaluation. One such general standard is REBA [29]. In such an
implementation, the feature could instead give the workstation an ergonomic score
and advise employers whether a workstation needs to implement some change.

Lastly, another programming language could be used to implement a similar ap-
plication. There are languages such as Python which are better supported by ML
libraries compared to Java. Furthermore, Python has a large amount of online
tutorials and resources for building computer vision projects.

33


Bibliography

[1] “Factory worker girl hard helmet indoors industrial industry,” PxHere, Jun. 10,
2018. [Online]. Available: https://pxhere.com/en/photo/1558739 (visited
on 05/05/2021).

[2] C. Behle, “The multiple benefits of video surveillance in manufacturing,” AXIS
Secure Insights, Jan. 4, 2021. [Online]. Available: https://www.axis.com/
blog/secure-insights/video-surveillance-manufacturing/ (visited on
02/12/2021).

[3] “Solme ab,” AviX, n.d. [Online]. Available: https://www.avix.se/om-oss
(visited on 05/09/2021).

[4] “Avixsuite,” AviX, n.d. [Online]. Available: https://www.avix.se (visited on
05/09/2021).

[5] C. Lamarre, “What is line balancing and how to achieve it,” Tulip, Aug. 14,
2019. [Online]. Available: https://tulip.co/blog/lean-manufacturing/
what - is - line - balancing - and - how - to - achieve - it/ (visited on
05/09/2021).

[6] Juran, “Guide to failure mode and effect analysis – fmea,” Juran, Apr. 2, 2018.
[Online]. Available: https://www.juran.com/blog/guide-to-failure-
mode-and-effect-analysis-fmea/ (visited on 05/09/2021).

[7] A. Tezel, “Introduction to smed: A neglected method in lean construc-
tion,” Lean Construction Blog, Aug. 18, 2016. [Online]. Available: https :
//leanconstructionblog.com/Single- Minute- Exchange- of- Dies- A-
Neglected-Method-in-Lean-Construction.html (visited on 05/09/2021).

[8] D. Harwell and E. Dou, “Huawei tested AI software that could recognize uighur
minorities and alert police, report says,” The Washington Post, Dec. 8, 2020.
[Online]. Available: https://www.washingtonpost.com/technology/2020/
12/08/huawei-tested-ai-software-that-could-recognize-uighur-
minorities-alert-police-report-says/ (visited on 04/23/2021).

[9] Ó. F. Civieta and J. Ravindran, “FBI warns of the rise of ’deepfakes’ in
coming months and explains how to spot them easily,” Business Insider,
Mar. 29, 2021. [Online]. Available: https://www.businessinsider.com/fbi-
investigation - generated - computer - ai - artificial - intelligence -
abuse-misinformation-porn-2021-3?r=US&IR=T (visited on 04/23/2021).

[10] R. Schmelzer, “Should we be afraid of ai?” Forbes, 2019-10-31. [Online]. Avail-
able: https://www.forbes.com/sites/cognitiveworld/2019/10/31/
should-we-be-afraid-of-ai/?sh=5638db7e4331 (visited on 04/27/2021).

[11] “The eu general data protection regulation (gdpr) and face images,” D-ID,
Sep. 2018. [Online]. Available: https://www.deidentification.co/wp-

34

https://pxhere.com/en/photo/1558739
https://www.axis.com/blog/secure-insights/video-surveillance-manufacturing/
https://www.axis.com/blog/secure-insights/video-surveillance-manufacturing/
https://www.avix.se/om-oss
https://www.avix.se
https://tulip.co/blog/lean-manufacturing/what-is-line-balancing-and-how-to-achieve-it/
https://tulip.co/blog/lean-manufacturing/what-is-line-balancing-and-how-to-achieve-it/
https://www.juran.com/blog/guide-to-failure-mode-and-effect-analysis-fmea/
https://www.juran.com/blog/guide-to-failure-mode-and-effect-analysis-fmea/
https://leanconstructionblog.com/Single-Minute-Exchange-of-Dies-A-Neglected-Method-in-Lean-Construction.html
https://leanconstructionblog.com/Single-Minute-Exchange-of-Dies-A-Neglected-Method-in-Lean-Construction.html
https://leanconstructionblog.com/Single-Minute-Exchange-of-Dies-A-Neglected-Method-in-Lean-Construction.html
https://www.washingtonpost.com/technology/2020/12/08/huawei-tested-ai-software-that-could-recognize-uighur-minorities-alert-police-report-says/
https://www.washingtonpost.com/technology/2020/12/08/huawei-tested-ai-software-that-could-recognize-uighur-minorities-alert-police-report-says/
https://www.washingtonpost.com/technology/2020/12/08/huawei-tested-ai-software-that-could-recognize-uighur-minorities-alert-police-report-says/
https://www.businessinsider.com/fbi-investigation-generated-computer-ai-artificial-intelligence-abuse-misinformation-porn-2021-3?r=US&IR=T
https://www.businessinsider.com/fbi-investigation-generated-computer-ai-artificial-intelligence-abuse-misinformation-porn-2021-3?r=US&IR=T
https://www.businessinsider.com/fbi-investigation-generated-computer-ai-artificial-intelligence-abuse-misinformation-porn-2021-3?r=US&IR=T
https://www.forbes.com/sites/cognitiveworld/2019/10/31/should-we-be-afraid-of-ai/?sh=5638db7e4331
https://www.forbes.com/sites/cognitiveworld/2019/10/31/should-we-be-afraid-of-ai/?sh=5638db7e4331
https://www.deidentification.co/wp-content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf
https://www.deidentification.co/wp-content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf
https://www.deidentification.co/wp-content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf


Bibliography

content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf (visited on
04/29/2021).

[12] K. M. Fincher and P. E. Tetlock, “Perceptual dehumanization of faces is ac-
tivated by norm violations and facilitates norm enforcement.,” Journal of Ex-
perimental Psychology: General, vol. 145, no. 2, pp. 131–146, Feb. 2016. doi:
10.1037/xge0000132. [Online]. Available: https://doi.org/10.1037/
xge0000132.

[13] “The advantages of ergonomics,” Oregon OSHA, n.d. [Online]. Available:
https://www.oshatrain.org/courses/pdf/ergoadvantages.pdf (vis-
ited on 03/28/2021).

[14] T. R. Waters and R. B. Dick, “Evidence of health risks associated with pro-
longed standing at work and intervention effectiveness,” Rehabilitation Nurs-
ing, vol. 40, no. 3, pp. 148–165, May 2015. doi: 10.1002/rnj.166. [Online].
Available: https://doi.org/10.1002/rnj.166.

[15] “Computer vision,” IBM, 2021-05-12. [Online]. Available: https://www.ibm.
com/topics/computer-vision (visited on 04/27/2021).

[16] S. A. Papert, “The summer vision project,” Artificial Intelligence group, Jul. 1,
1966. [Online]. Available: https://dspace.mit.edu/handle/1721.1/6125
(visited on 04/22/2021).

[17] “Artificial intelligence (ai),” IBM, 2020-06-03. [Online]. Available: https://
www.ibm.com/cloud/learn/what-is-artificial-intelligence (visited
on 04/11/2021).

[18] “What is ai?” Council of Europe, 2021. [Online]. Available: https://www.
coe . int / en / web / artificial - intelligence / what - is - ai (visited on
04/12/2021).

[19] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine
Learning, 2nd ed., ser. Adaptive Computation and Machine Learning. Cam-
bridge, MA: MIT Press, 2018, 504 pp., isbn: 978-0-262-03940-6.

[20] E. Hjelmås and B. K. Low, “Face detection: A survey,” Computer Vision and
Image Understanding, vol. 83, no. 3, pp. 236–274, Sep. 2001. doi: 10.1006/
cviu.2001.0921. [Online]. Available: https://doi.org/10.1006/cviu.
2001.0921.

[21] “Ai vs. machine learning vs. deep learning vs. neural networks: What’s the
difference?” IBM, May 27, 2020. [Online]. Available: https://www.ibm.com/
cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-
networks.

[22] S.-C. Wang, “Artificial neural network,” in Interdisciplinary Computing in
Java Programming, Springer US, 2003, pp. 81–100. doi: 10.1007/978-1-
4615-0377-4_5. [Online]. Available: https://doi.org/10.1007/978-1-
4615-0377-4_5.

[23] “Artificial neural network,” Wikimedia, Feb. 22, 2011. [Online]. Available:
https : / / commons . wikimedia . org / wiki / File : Artificial _ neural _
network.svg (visited on 05/14/2021).

[24] “Avix ergo,” AviX, n.d. [Online]. Available: https://www.avix.se/lean-
production-verktyg/avix-ergo (visited on 04/27/2021).

35

https://www.deidentification.co/wp-content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf
https://www.deidentification.co/wp-content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf
https://www.deidentification.co/wp-content/uploads/2018/09/White-Paper-GDPR-and-D-ID.pdf
https://doi.org/10.1037/xge0000132
https://doi.org/10.1037/xge0000132
https://doi.org/10.1037/xge0000132
https://www.oshatrain.org/courses/pdf/ergoadvantages.pdf
https://doi.org/10.1002/rnj.166
https://doi.org/10.1002/rnj.166
https://www.ibm.com/topics/computer-vision
https://www.ibm.com/topics/computer-vision
https://dspace.mit.edu/handle/1721.1/6125
https://www.ibm.com/cloud/learn/what-is-artificial-intelligence
https://www.ibm.com/cloud/learn/what-is-artificial-intelligence
https://www.coe.int/en/web/artificial-intelligence/what-is-ai
https://www.coe.int/en/web/artificial-intelligence/what-is-ai
https://doi.org/10.1006/cviu.2001.0921
https://doi.org/10.1006/cviu.2001.0921
https://doi.org/10.1006/cviu.2001.0921
https://doi.org/10.1006/cviu.2001.0921
https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks
https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks
https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks
https://doi.org/10.1007/978-1-4615-0377-4_5
https://doi.org/10.1007/978-1-4615-0377-4_5
https://doi.org/10.1007/978-1-4615-0377-4_5
https://doi.org/10.1007/978-1-4615-0377-4_5
https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg
https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg
https://www.avix.se/lean-production-verktyg/avix-ergo
https://www.avix.se/lean-production-verktyg/avix-ergo


Bibliography

[25] A. Devaux, N. Paparoditis, F. Precioso, and B. Cannelle, “Face blurring
for privacy in street-level geoviewers combining face, body and skin detec-
tors,” Institut Géographique National, Jan. 2009. [Online]. Available: http:
//citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.525&rep=
rep1&type=pdf.

[26] A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H.
Adam, H. Neven, and L. Vincent, “Large-scale privacy protection in google
street view,” in 2009 IEEE 12th International Conference on Computer Vision,
IEEE, Sep. 2009. doi: 10.1109/iccv.2009.5459413. [Online]. Available:
https://doi.org/10.1109/iccv.2009.5459413.

[27] K. Brkic, I. Sikiric, T. Hrkac, and Z. Kalafatic, “I know that person: Generative
full body and face de-identification of people in images,” in 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition Workshops (CVPRW),
IEEE, Jul. 2017. doi: 10.1109/cvprw.2017.173. [Online]. Available: https:
//doi.org/10.1109/cvprw.2017.173.

[28] R. Venkatesulu and S. Venkatesh Koundinya, “Human motion analysis based
on human pose estimation,” M.S. thesis, CHALMERS UNIVERSITY OF
TECHNOLOGY, SE-412 96 Gothenburg, 2020.

[29] M. Middlesworth, “A step-by-step guide to the reba assessment tool,” Ergo-
Plus, Oct. 17, 2020. [Online]. Available: https://ergo- plus.com/reba-
assessment-tool-guide/ (visited on 04/10/2021).

[30] D. Kerr, J. Pengilley, and R. Garwood, “Assessment and visualisation of ma-
chine tool wear using computer vision,” The International Journal of Ad-
vanced Manufacturing Technology, vol. 28, no. 7-8, pp. 781–791, May 2005.
doi: 10.1007/s00170-004-2420-0. [Online]. Available: https://doi.org/
10.1007/s00170-004-2420-0.

[31] E. C. Latorre, M. D. Zuniga, E. Arriaza, F. Moya, and C. Nikulin, “Auto-
matic registration of footsteps in contact regions for reactive agility training in
sports,” Sensors, vol. 20, no. 6, p. 1709, Mar. 2020. doi: 10.3390/s20061709.
[Online]. Available: https://doi.org/10.3390/s20061709.

[32] “Worker metal steel,” Pixbay, Aug. 11, 2016. [Online]. Available: https://
pixabay . com / photos / worker - metal - steel - manufacturing - 4395768/
(visited on 05/05/2021).

[33] M. Singh, “How to blur faces in images using opencv in python,” TechGeek-
Buzz, [Online]. Available: https://www.techgeekbuzz.com/how-to-blur-
faces-in-images-using-opencv-in-python/ (visited on 03/02/2021).

[34] “Detected-with-yolo–schreibtisch-mit-objekten,” Jan. 14, 2019. [Online]. Avail-
able: https : / / commons . wikimedia . org / wiki / File : Detected - with -
YOLO--Schreibtisch-mit-Objekten.jpg (visited on 05/05/2021).

[35] M. Blume, “Kansas city assembly,” Wikimedia, Oct. 8, 2008. [Online]. Avail-
able: https : / / commons . wikimedia . org / wiki / File : Kansas _ City _
Assembly.png (visited on 05/05/2021).

[36] D. R. Bassett, L. P. Toth, S. R. LaMunion, and S. E. Crouter, “Step count-
ing: A review of measurement considerations and health-related applications,”
Sports Medicine, vol. 47, no. 7, pp. 1303–1315, Dec. 2016. doi: 10.1007/

36

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.525&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.525&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.525&rep=rep1&type=pdf
https://doi.org/10.1109/iccv.2009.5459413
https://doi.org/10.1109/iccv.2009.5459413
https://doi.org/10.1109/cvprw.2017.173
https://doi.org/10.1109/cvprw.2017.173
https://doi.org/10.1109/cvprw.2017.173
https://ergo-plus.com/reba-assessment-tool-guide/
https://ergo-plus.com/reba-assessment-tool-guide/
https://doi.org/10.1007/s00170-004-2420-0
https://doi.org/10.1007/s00170-004-2420-0
https://doi.org/10.1007/s00170-004-2420-0
https://doi.org/10.3390/s20061709
https://doi.org/10.3390/s20061709
https://pixabay.com/photos/worker-metal-steel-manufacturing-4395768/
https://pixabay.com/photos/worker-metal-steel-manufacturing-4395768/
https://www.techgeekbuzz.com/how-to-blur-faces-in-images-using-opencv-in-python/
https://www.techgeekbuzz.com/how-to-blur-faces-in-images-using-opencv-in-python/
https://commons.wikimedia.org/wiki/File:Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg
https://commons.wikimedia.org/wiki/File:Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg
https://commons.wikimedia.org/wiki/File:Kansas_City_Assembly.png
https://commons.wikimedia.org/wiki/File:Kansas_City_Assembly.png
https://doi.org/10.1007/s40279-016-0663-1
https://doi.org/10.1007/s40279-016-0663-1
https://doi.org/10.1007/s40279-016-0663-1


Bibliography

s40279 - 016 - 0663 - 1. [Online]. Available: https : / / doi . org / 10 . 1007 /
s40279-016-0663-1.

[37] V. Feng, “An overview of resnet and its variants,” Towards data science,
Jul. 15, 2017. [Online]. Available: https : / / towardsdatascience . com /
an- overview- of- resnet- and- its- variants- 5281e2f56035 (visited on
05/05/2021).

[38] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3fd: Single shot
scale-invariant face detector,” CoRR, vol. abs/1708.05237, Aug. 2017. arXiv:
1708.05237. [Online]. Available: http://arxiv.org/abs/1708.05237.

[39] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M.
Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” CoRR, vol. abs/1704.04861, Jun. 17, 2017.
arXiv: 1704.04861. [Online]. Available: http://arxiv.org/abs/1704.04861.

[40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and
A. C. Berg, “SSD: single shot multibox detector,” CoRR, vol. abs/1512.02325,
Dec. 8, 2015. arXiv: 1512.02325. [Online]. Available: http://arxiv.org/
abs/1512.02325.

[41] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Re-
altime multi-person 2d pose estimation using part affinity fields,” CoRR,
vol. abs/1812.08008, 2018. arXiv: 1812 . 08008. [Online]. Available: http :
//arxiv.org/abs/1812.08008.

[42] “Industrial workers in uniform and safety equipment relaxing on a break
drinking coffee and talking inside factory,” freepik, 2020. [Online]. Available:
https://www.freepik.com/free-photo/industrial-workers-uniform-
safety-equipment-relaxing-break-drinking-coffee-talking-inside-
factory_11030706.htm (visited on 05/11/2021).

[43] M. Blume, “Audi s4 (20100325-dsc01393),” Wikimedia, Mar. 25, 2010. [On-
line]. Available: https://commons.wikimedia.org/wiki/File:Audi_S4_
(20100325-DSC01393).jpg (visited on 05/11/2021).

[44] D. Larson, “Jimmy likes his treadmill,” vimeo, Dec. 30, 2015. [Online]. Avail-
able: https://vimeo.com/150376758 (visited on 05/11/2021).

[45] “An update to the imagenet website and dataset,” ImageNet, 2021. [Online].
Available: https://www.image-net.org/update-mar-11-2021.php (visited
on 05/07/2021).

[46] “Visual genome,” Visual Genome, 2021. [Online]. Available: https : / /
visualgenome.org/ (visited on 05/07/2021).

[47] “Coco dataset,” COCO, 2021. [Online]. Available: https://cocodataset.
org/#home (visited on 05/07/2021).

[48] “Open images dataset,” Google Open Source, 2021. [Online]. Available: https:
/ / opensource . google / projects / open - images - dataset (visited on
05/07/2021).

[49] “Intel,” OpenCV, 2021. [Online]. Available: https://opencv.org/intel/
(visited on 04/29/2021).

[50] “About ffmpeg,” FFmpeg, 2019. [Online]. Available: https://www.ffmpeg.
org/about.html (visited on 04/29/2021).

37

https://doi.org/10.1007/s40279-016-0663-1
https://doi.org/10.1007/s40279-016-0663-1
https://doi.org/10.1007/s40279-016-0663-1
https://doi.org/10.1007/s40279-016-0663-1
https://doi.org/10.1007/s40279-016-0663-1
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035
https://arxiv.org/abs/1708.05237
http://arxiv.org/abs/1708.05237
https://arxiv.org/abs/1704.04861
http://arxiv.org/abs/1704.04861
https://arxiv.org/abs/1512.02325
http://arxiv.org/abs/1512.02325
http://arxiv.org/abs/1512.02325
https://arxiv.org/abs/1812.08008
http://arxiv.org/abs/1812.08008
http://arxiv.org/abs/1812.08008
https://www.freepik.com/free-photo/industrial-workers-uniform-safety-equipment-relaxing-break-drinking-coffee-talking-inside-factory_11030706.htm
https://www.freepik.com/free-photo/industrial-workers-uniform-safety-equipment-relaxing-break-drinking-coffee-talking-inside-factory_11030706.htm
https://www.freepik.com/free-photo/industrial-workers-uniform-safety-equipment-relaxing-break-drinking-coffee-talking-inside-factory_11030706.htm
https://commons.wikimedia.org/wiki/File:Audi_S4_(20100325-DSC01393).jpg
https://commons.wikimedia.org/wiki/File:Audi_S4_(20100325-DSC01393).jpg
https://vimeo.com/150376758
https://www.image-net.org/update-mar-11-2021.php
https://visualgenome.org/
https://visualgenome.org/
https://cocodataset.org/#home
https://cocodataset.org/#home
https://opensource.google/projects/open-images-dataset
https://opensource.google/projects/open-images-dataset
https://opencv.org/intel/
https://www.ffmpeg.org/about.html
https://www.ffmpeg.org/about.html


Bibliography

[51] “Install tensorflow java,” TensorFlow, 2021. [Online]. Available: https://www.
tensorflow.org/install/lang_java (visited on 02/16/2021).

38

https://www.tensorflow.org/install/lang_java
https://www.tensorflow.org/install/lang_java


A
Appendix 1

A.1 Library Description

A.1.1 OpenCV
OpenCV is a large open-source computer vision library and is as such not limited
to any particular feature. It was originally developed by Intel in 1998 and became
publicly available in 2000 [49]. Intel is still funding the core development team and
maintains the build farm of OpenCV [49]. OpenCV supports the Java programming
language. However, while this may be the case, the majority of online resources and
learning material for this library is written in Python or C++.

With OpenCV, it is also possible to satisfy all the video processing needs as it is a
video and image processing library. That means that it supports blurring, drawing
pose estimation, reading a video file and exporting a video file. Despite that, it does
not have support for audio processing so the exported video file will be missing the
audio. The audio can be added to the output video file using for example ’ffmpeg’
[50].

A.1.2 TensorFlow
TensorFlow is an open-source platform for machine learning. By combining Ten-
sorFlow with OpenCV or ND4J, it can serve as a reliable way to create and deploy
machine learning models. TensorFlow has a community hub on their website where
hundreds of trained models can be downloaded and deployed for free [51]. Tensor-
Flow was used in this project to deploy the pose estimation deep neural network.

I


	Introduction
	Purpose
	Problem Description
	Ethical Evaluation
	Thesis Outline

	Background
	Computer Vision
	Artificial Intelligence (AI)
	Machine Learning (ML)
	Neural Networks (NN)

	The AviX Software Suite
	Related Work

	Features for AviX
	Face blurring
	Tool Highlighting
	Ergonomic Risk Detection
	Footstep Counting

	Method
	Work Process
	Utilized Libraries
	Software Evaluation
	Face Blur
	Ergonomic Risk Detection


	Implementation
	Program Structure
	Evaluators
	Face Detection
	Object Detection
	Pose Detection

	Image processing
	Feature Implementation
	Face Blur
	Tool Highlighting
	Ergonomic Risk Detection
	Footstep Counting


	Results and Discussion
	Face Blur
	Tool Highlighting
	Ergonomic Risk Detection
	Footstep counting

	Conclusions and Future Work
	Conclusions
	Future Work

	Bibliography
	Appendix 1
	Library Description
	OpenCV
	TensorFlow