Optical Load Detection
Load Weighing for Construction Machines using Stereo Vision
and Convolutional Neural Networks

Master’s thesis in Systems, Control and Mechatronics

DANIEL STRÅHLE
KEVIN WINGÅRD OLSSON

DEPARTMENT OF MECHANICS AND MARITIME SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2022
www.chalmers.se

www.chalmers.se


Master’s thesis 2022:21

Optical Load Detection

Load Weighing for Construction Machines using Stereo Vision and
Convolutional Neural Networks

DANIEL STRÅHLE
KEVIN WINGÅRD OLSSON

Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems

Chalmers University of Technology
Gothenburg, Sweden 2022


Optical Load Detection
Load Weighing for Construction Machines using Stereo Vision
and Convolutional Neural Networks
Daniel Stråhle
Kevin Wingård Olsson

© Daniel Stråhle and Kevin Wingård Olsson, 2022.

Supervisor: Mathias Andreasson, CPAC Systems AB
Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences

Master’s Thesis 2022:21
Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Demonstration of depth measurement using the stereo camera mounted on
an excavator.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2022

iv


Optical Load Detection
Load Weighing for Construction Machines using Stereo Vision and
Convolutional Neural Networks
Daniel Stråhle
Kevin Wingård Olsson
Department of Vehicle Engineering and Autonomous Systems
Chalmers University of Technology

Abstract
Accurate excavation monitoring is important for the handling of materials within
the construction industry. Modern construction machines provide built-in systems
for weighing handled goods. In this thesis, an alternative optical weighing system
is developed and implemented for an excavator and a wheel loader. The optical
system detects and provides the volume and weight of the handled material through
fill-factor estimation. The methodology is based on depth data and images captured
by a stereo camera, mounted on the machines. By using a region-based convolu-
tional neural network (CNN), localization of material and fill-factor estimation are
managed jointly. Material classification is also proved to be possible using gathered
images and a simple CNN. By combining the fill-factor and information about the
material, weight is obtained. Evaluations reveal that the system measures fill-factor
to mean absolute percentage errors (MAPE), relative to the maximum capacity of
the excavator and the wheel loader, of 3.3 % and 3.0 % respectively.

Keywords: Excavation Monitoring, CNN, Faster R-CNN, RPN, Range Sensor, Stereo
Camera, Computer Vision, Material Classification.

v


Acknowledgements
This project would not be possible without all the external support we have received.
First of all, we want to thank our great supervisor Mathias Andreasson, who intro-
duced us to the project, the office, and CPAC Systems AB. His joyful ambitions
and knowledge contributed to the large inspiration for advancing the project. Many
thanks are dedicated to our examiner Peter Forsberg, who contributed to the project
with insightful discussions and advice. Furthermore, the achieved results would not
be possible without Niklas Sjöstedt, who supported us with knowledge regarding
the machines and operating them during the data acquisitions. Additional thanks
to Marcus Carlsson for operating the wheel loader during one of the acquisition
campaigns. At last, we share great gratitude to everyone at CPAC Systems AB for
welcoming us to the office and everything around it, as well as for supporting us
with the project.

Daniel Stråhle and Kevin Wingård Olsson, Gothenburg, June 2022

Thesis advisor: Mathias Andreasson, CPAC Systems AB
Thesis examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences

vii


List of Acronyms

API application programming interface

CNN convolutional neural network

FPS frames per second

IMU inertial measurement unit
IOU intersection over union

LIDAR light detection and ranging

MAE mean absolute error
MAPE mean absolute percentage error

R-CNN region based convolutional neural network
RADAR radio detecting and ranging
RGB red, green and blue
ROI region of interest
RPN region proposal network

STD standard deviation
SVM support-vector machine

TOF time-of-flight

ix


List of Acronyms

x


Contents

List of Acronyms ix

List of Figures xiii

List of Tables xvii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Purpose and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Range Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 3D Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Binocular Vision . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Stereo Matching Methods . . . . . . . . . . . . . . . . . . . . 7
2.2.4 3D Reconstruction and Data Representation . . . . . . . . . . 8

2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 9
2.4 The R-CNN Framework . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Versions of R-CNN . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Region Proposal Network . . . . . . . . . . . . . . . . . . . . 12
2.4.3 ROI-Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.4 Evaluation and Loss Functions . . . . . . . . . . . . . . . . . . 14
2.4.5 Summary of Faster R-CNN . . . . . . . . . . . . . . . . . . . 15

3 Methods 17
3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Choice of Equipment . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Camera Settings and Data Acquisition Pipeline . . . . . . . . 18

3.2 Depth Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Fill-Factor Estimation . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Material Classification . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Initial Investigation . . . . . . . . . . . . . . . . . . . . . . . . 21

xi


Contents

3.4.2 Excavator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Wheel Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.4 Material Collection . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Results 29
4.1 Fill-Factor and Weight Estimations . . . . . . . . . . . . . . . . . . . 29
4.2 Material Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Discussion 35
5.1 Fill-Factor Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Material Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 The Complete Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 System Improvements and Future Work . . . . . . . . . . . . . . . . 38

6 Conclusion 41

Bibliography 43

A Material Images, Densities and Full Confusion Matrix I

xii


List of Figures

2.1 Triangulation scheme of stereo vision with relevant geometry for esti-
mating the distance, z, to a point P with world coordinates (x, y, z).
Each image plane has its own coordinate system. . . . . . . . . . . . 6

2.2 Simple CNN architecture with the three main components: convolutional-
, pooling- and fully-connected layers. The convolutional layers slides
a filter across the input, the pooling layer performs downsampling.
Lastly, the fully-connected layer is used for evaluating the produced
feature map for generating predictions. . . . . . . . . . . . . . . . . . 9

2.3 Overview of structure and components of the three versions within
the R-CNN family. The main difference is the handling of region
proposals. The R-CNN and the Fast R-CNN use the fixed selective
search to generate proposals, compared to the Faster R-CNN which
uses a region proposal network (RPN). Furthermore, depending on
the version, a special region of interest (ROI) pooling layer is required
before evaluation. Finally, evaluation is done through support-vector
machines (SVMs) or fully-connected layers. . . . . . . . . . . . . . . 11

2.4 Box scales and aspect ratios are combined to generate anchor boxes
of various shapes and sizes, as depicted to the right. These boxes are
generated for each anchor (black dot) in the grid. . . . . . . . . . . . 12

2.5 The intersection over union (IOU) is the intersecting area of the boxes
divided by the union area of the boxes. . . . . . . . . . . . . . . . . . 12

2.6 2x2 region of interest (ROI) pooling. The input is divided in a 2x2
grid and the maximum of each area, marked with red squares, is the
output value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 All components and connections to build the Faster R-CNN structure. 15

3.1 The color map Jet, used for colorizing depth data [29]. Low distance
values are mapped to blue, while higher values tend towards red. . . . 18

3.2 Examples of generated depth images and comparison of filtering in-
terval. The image to the left is the the regular RGB image displaying
the scene, middle image is a produced depth image with a large dis-
tance interval and the right image is a generated depth image with
an adapted and tighter distance interval. . . . . . . . . . . . . . . . . 19

xiii


List of Figures

3.3 An overview of the load detection system describing the flow and
processing of the data. By utilizing a depth and RGB image, the
weight of loaded material can be calculated. The system uses CNN
to determine both the fill-factor and material type. Combining this
information with a predetermined reference volume and density re-
sults in a weight prediction. . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Weight estimation using fill-factor, reference volume, and material
prediction. The reference volume is external information specified
on the bucket, the fill-factor is estimated through the Faster R-CNN
and the material density is extracted from a database based on the
prediction from the material classifier. . . . . . . . . . . . . . . . . . 20

3.5 Initial static camera setup for feasibility evaluation. Left image de-
picts a side view of the setup, the middle image the field of view
captured from the RGB camera and the image to the right the pro-
duced depth image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Intel RealSense Depth Camera D435i mounted on the excavator, used
for collecting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7 Sample images captured by the stereo camera mounted on the exca-
vator. The left is the RGB image and the right the generated depth
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.8 Intel RealSense Depth Camera D435i mounted on the wheel loader,
used for collecting data. . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.9 Sample images captured by the stereo camera mounted on the wheel
loader. The left is the RGB image and the right the generated depth
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.10 Examples of piles used for acquiring RGB images used for training
the material classification network. . . . . . . . . . . . . . . . . . . . 24

3.11 An example confusion matrix. Material 1 was predicted correctly,
while material 2 and 3 had some misclassifications. . . . . . . . . . . 27

4.1 Two samples from the static campaign with ground truth (GT) and
predictions (est.). Ground truth bounding boxes are drawn in white
and predicted bounding boxes in yellow. The estimated fill-factors
and the resulting weight estimations are presented together with the
sample error (diff.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Two samples from the excavator campaign with ground truth (GT)
and predictions (est.). Ground truth bounding boxes are drawn in
white and predicted bounding boxes in yellow. The estimated fill-
factors and the resulting weight estimations are presented together
with the sample error (diff.). . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Two samples from the wheel loader campaign with ground truth (GT)
and predictions (est.). Ground truth bounding boxes are drawn in
white and predicted bounding boxes in yellow. The estimated fill-
factors and the resulting weight estimations are presented together
with the sample error (diff.). . . . . . . . . . . . . . . . . . . . . . . . 30

xiv


List of Figures

4.4 Sample and absolute relative errors with respect to loaded (true)
weights for the static campaign. . . . . . . . . . . . . . . . . . . . . . 31

4.5 Sample and absolute relative errors with respect to loaded (true)
weights for the excavator campaign. . . . . . . . . . . . . . . . . . . . 32

4.6 Sample and absolute relative errors with respect to loaded (true)
weights for the wheel loader campaign. . . . . . . . . . . . . . . . . . 32

4.7 Confusion matrix with seven classes. Correct predictions lie on the
diagonal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A.1 The materials used for training and evaluating the material classifi-
cation network. The input to the network is extracted patches from
the center of the images. . . . . . . . . . . . . . . . . . . . . . . . . . I

A.2 Gravel type 8 0-32 mm, used for training and evaluating the fill-factor
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.3 Confusion matrix including all individual material types. Class names
indicate the type of material and the fineness. For instance "Gravel
Type 1 0-16" refers to one type of gravel in the dataset with grain
sizes in the interval 0 mm to 16 mm. . . . . . . . . . . . . . . . . . . . III

xv


List of Figures

xvi


List of Tables

2.1 Corresponding labels for IOU values. . . . . . . . . . . . . . . . . . . 13

3.1 Distance intervals used for filtering depth data. . . . . . . . . . . . . 25
3.2 Sizes of datasets used and hyperparameters used for training Faster

R-CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Dataset sizes and number of classes used in the material classification

network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 The performance of the fill-factor and weight estimations, evaluated
through presented metrics. . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Sum of sample errors for each campaign together with the sum of the
loaded weights and relative error. . . . . . . . . . . . . . . . . . . . . 33

A.1 Material categories and approximate density intervals (tons per cubic
meter). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

xvii


List of Tables

xviii


1
Introduction

In this chapter, the background and related work encompassing optical load detection
are covered. Furthermore, the objective and scope of the project are presented.

1.1 Background
Excavation and material transportation are major works within the construction in-
dustry. For large-scale construction industries, there is a great endeavor of improv-
ing safety and productivity to reduce costs, work labor and environmental impact.
Hence, there is a desire to increase the efficiency of material loading and transporta-
tion. This can be accomplished by high-accuracy sensing and excavation progress
monitoring. For proper handling and transportation of materials, it is important to
know the type and the weight of the goods. The load weighing systems available
for excavators and wheel loaders today are mostly built-in systems that measure
axis loads. However, this method has a cumbersome setup. Hence, alternative or
complementary methods are of interest. One idea is an optical load detection sys-
tem to automatically detect the type, volume and weight of the goods. As of now,
not many attempts have been made of optical weight estimation for construction
machines. The development of such a system has the potential to open up many
new possibilities. For example, retrofitting machines with internal sensors used by
current methods might not be possible. In such a case, an external optical system
may be the sole option.

1.2 Related Work
An optical system commonly involves object detection and classification, which are
both common tasks within computer vision and machine learning domains. 3D
reconstruction from 2D images and the concept of estimating the volume of objects
using computer vision are not new either. There have been attempts at volume
estimation using optical sensors for various applications. To name a few, Rundgren
uses a multi-view system for 3D reconstruction and volume estimation of timber
loads [1]. Another work by Artaso and López-Nicolás uses a time-of-flight (TOF)
camera and two structured light cameras for measuring the volume of merchandise
in a logistics application [2]. Another example is dietary assessment. The number

1


1. Introduction

of calories is estimated from food images by classifying and estimating the volume
of food [3].

More related to construction and excavation monitoring, previous work by Lu et al.
uses a neural network-based approach for fill-factor estimation and bucket detection
on construction machines [4]. Fill-factor refers to the ratio of the loaded material’s
volume with respect to the maximal load capacity of the bucket or container. Ac-
cording to the authors, bucket fill-factor estimation remains one of the key challenges
in the automation of construction machines. Lu et al. identify some of the diffi-
culties in estimating the volume of handled material. A particular difficulty is that
on-board weighing systems are unable to estimate loaded volume without external
information about the density of the material. However, an optical solution would
be able to solve the issue.

The method proposed by Lu et al. comprises three stages: pre-processing, machine
learning and post-processing [4]. In the pre-processing stage, depth images are
generated from data captured by a stereo camera. The depth images are used for
predicting bucket fill-factor through machine learning. Lastly, the predictions go
through post-processing based on a probabilistic approach.

Another previous work is conducted by Rasul et al. [5]. They use integrative
methodologies for effective excavation progress monitoring. The work involves two
aspects: volume estimation and 5D mapping. For volume estimation, two meth-
ods are used: direct estimation and indirect estimation. For the direct estimation,
the excavated ground volume is estimated by comparing a 3D measurement of the
ground with a reference measurement. Both measurements are captured by a stereo
camera. An issue with the direct estimation is the risk of an occlusion area in the
digging space, which hinders accurate ground detections. The indirect estimation
uses the bucket volume, given by a 3D model, as a reference together with a mea-
surement of a filled bucket. The difference between the 3D model and measurement
is the excavated volume.

The second aspect investigated by Rasul et al. is the idea of 5D mapping for ma-
terial classification [5]. It includes information on the excavated ground in terms
of geometric space and material properties using information from several sensors.
3D data obtained by the stereo camera is fused with intensity data, obtained by a
light detection and ranging (LIDAR) sensor, and ground resistive force, obtained
through pressure data of the excavator. By fusing information, more accurate ma-
terial classifications can be achieved.

1.3 Purpose and Goals

This project aims at improving excavation progress monitoring. A comprehensive
solution for detecting, classifying and estimating the volume and weight of loaded
materials using optical means is to be developed. By accurately monitoring the
loaded material, trucks can be filled optimally and the movement of materials can

2


1. Introduction

be closely tracked.

The purpose of the project is to develop a load detection system. The system
should work both as a standalone system, and as a complement to existing weighing
and classification systems. In this project, the targeted construction equipment are
excavators and wheel loaders.

The project composes of several parts, which can be summarized by the following
objectives:

• Choice and implementation of hardware

• Identify the container of material in sensor data

• Material volume through fill-factor estimation

• Material classification

• Material weight estimation

The main objectives are the fill-factor estimation of the container and material
classification. For this application, the container corresponds to the buckets of
the construction machines. When investigating and working on the objectives, the
following questions are answered:

1. How can an optical system be implemented for volume and weight estimation
of handled material, for a construction machine?

2. What performance, in terms of mean absolute percentage error (MAPE), can
be achieved for fill-factor and weight estimation of construction material using
convolutional neural networks?

3. What performance, in terms of classification accuracy, can be achieved for
material classification using convolutional neural networks?

1.4 Limitations
The system developed in the project should be a comprehensive solution in the
terms of solving fill-factor, volume and weight estimations of loaded materials, as
well as material classification. The focus is not to provide a scalable and generalized
solution, but rather to provide a basis containing all the components which can be
further developed and improved. Hence, the scope of the project is limited to:

• Consider only one excavator and wheel loader model.

• Consider only one bucket size and type for each machine.

• Requirements for the operator are specified such as how to level and orientate

3


1. Introduction

the buckets when the data is gathered.

• There are no strict requirements of efficiency in terms of the computational
speed of the system.

• Data processing and predictions are performed offline.

4


2
Theory

This chapter explains the necessary theory for understanding the method. The idea
is to use depth information. Therefore, techniques for acquiring this information
are presented, with a focus on binocular vision used in stereo cameras. Lastly, the
system utilizes machine learning, in specific convolutional neural networks and the
Faster R-CNN framework.

2.1 Range Sensors

Range sensors refer to sensors that capture 3D information from the sensor’s view-
point [6]. The depth is usually measured as the distance to the closest object(s),
either as a single point, scanning plane or as a whole image containing depth values
for each point. The acquired data is useful for many applications, such as robotics
and automation. Perception about the world is required for navigation or for de-
termining 3D properties of objects. There are plenty of sensors available today to
measure depth, some of which are presented here.

Several sensors revolve around two principles: active triangulation and time-of-flight
(TOF) [7]. Active triangulation generally refers to structured light depth cameras.
A projector is used to illuminate a surface with a known pattern, which is then
acquired by a camera. The acquired image contains a superimposed version of
the pattern. Depth can be obtained by comparing the patterns and performing
geometrical reconstruction. The performance of a structured light setup depends on
the choice of patterns, algorithms and the scene [8].

The other family of range sensors revolves around TOF. They emit a signal, and
receive the signal sometime later by reflection. By using the time between transmit-
ting and receiving the signal, as well as the speed of it, the distance to the reflection
point can be calculated [9]. This concept has been used in radio detecting and rang-
ing (RADAR) technology for a long time, using electromagnetic waves [7]. LIDAR
is another TOF sensor that uses laser [8]. It usually has a large field-of-view at the
cost of providing sparse depth information [10].

5


2. Theory

2.2 Stereo Vision
A range sensor based on binocular vision is the stereo camera. It utilizes two or more
cameras combined with triangulation for estimating depth. The following sections
introduce the theory behind computer vision and binocular vision.

2.2.1 3D Perception
Cameras map the 3D world to 2D images using projections [11]. A consequence
of projecting the world to a surface is the loss of one dimension, namely depth.
The projection is irreversible, which means that the depth dimension can not be
recovered from a single 2D image. However, the knowledge of 3D data is essential
for applications dealing with robot vision, automatic navigation, automotive safety
and many others [8]. Hence, acquiring 3D properties from environments has been an
active subject of research for a long time. Humans and most animals use a binocular
visual system (two eyes) [12]. The key point of a binocular vision system is the
difference between the left and right images due to slightly different perspectives.
This mechanism inspired image-based 3D reconstruction using images captured from
multiple points of view.

2.2.2 Binocular Vision
Figure 2.1 depicts an overview of a binocular vision system, with essential axes and
distances to be able to calculate the distance to a point P , with world coordinates
(x, y, z). The distance between the sensor and a point in the scene is also known as
depth.

Figure 2.1: Triangulation scheme of stereo vision with relevant geometry for esti-
mating the distance, z, to a point P with world coordinates (x, y, z). Each image
plane has its own coordinate system.

6


2. Theory

In Figure 2.1, O1 and O2 represents the optical centres of the cameras. The optical
distance between the cameras is called the baseline, denoted with b. Point P is
projected through the image plane of the left and right cameras at points P1 and
P2 with local image coordinates (X1, Y1) and (X2, Y2) respectively. Focal length,
denoted f , is the distance between the optical center of one camera to its image
plane. The aim is to find the distance z to the point P through triangulation. Two
triangles are formed: PO1O2 and PP1P2. By utilizing uniformity, the distance is
described by Equation (2.1).

b

z
= b−X1 +X2

z − f
(2.1)

Extracting the distance z from Equation (2.1) yields Equation (2.2).

z = bf

X1 −X2
= bf

d
(2.2)

In Equation (2.2), d = X1 −X2 denotes the disparity. It represents the horizontal
difference between corresponding pixels in the left and right images. The disparity
is an essential component of the binocular vision system. Since the rest of the values
in Equation (2.2) are constant and usually known, the task becomes to obtain the
disparity for each pixel in the image pair to retrieve the distance z. The collection of
disparity values describing the whole scene is called a disparity map. The procedure
of obtaining the disparity map is a matter of matching corresponding pixels in each
image using various matching algorithms.

2.2.3 Stereo Matching Methods
Stereo matching is a vital part of 3D reconstruction since it is the main step to re-
trieve depth information. However, stereo matching is challenging due to difficulties
such as noise, specular surfaces, ambiguous regions, repetitive patterns, transparent
objects and occlusions [13].

The methods for finding point correspondences can be broadly divided into two
categories: sparse and dense methods [12]. Sparse methods are usually feature-
based, meaning that sets of potential image locations are extracted and then matched
between the images. Nevertheless, sparse features can be difficult to recover in a 3D
scene. Dense methods are correlation-based and produce disparity estimates for all
image regions [14]. The dense stereo correspondence method thus produces a dense
disparity map with a disparity estimate for each pixel.

Scharstein and Szeliski propose a taxonomy for stereo algorithms in [14], where most
perform the following steps to solve the correspondence problem:

1. Matching cost computation
2. Cost (support) aggregation
3. Disparity computation/optimization
4. Disparity refinement

7


2. Theory

The procedure depends on the specific algorithm used. Matching cost computation
compares pixels in the left and right images using a cost function. A lower cost
means two pixels are more likely to be a matching pair. The cost function is usually
based on the difference in pixel intensities. However, the cost calculated from two
pixels might not have sufficient information to determine the match. Thus, cost
aggregation can be used to include information from nearby points. A frequent
method is to use a predefined window of some size for averaging or summing costs
from a region. The size of the window heavily affects the result and thus has to be
chosen wisely.

The disparity computation can be divided into local and global methods. Local
methods emphasize the matching cost computation and the cost aggregation steps.
The final disparities are chosen as the disparity for each pixel with the minimum cost
value [15]. However, the local approach is sensitive to image noise, occlusions and
blur areas. On the other hand, the global approach uses a more intelligent disparity
decision strategy by including assumptions about the images. Such assumptions
can be that similar regions within object boundaries should have uniform disparity
distribution. The global approach incorporates smoothness in the disparity estima-
tion, which results in fewer errors caused by disparity discontinuities, occlusions and
texture-less areas. The last step of disparity refinement acts as a post-processing
step, where noise and uncertainties are removed and the disparity map is optimized
to be more accurate. For example, occluded areas do not contain corresponding
point matches since the occluded region is only visible within one of the images.
Hence, occlusion filling can be used for estimating the disparities in these areas
using adjacent values.

2.2.4 3D Reconstruction and Data Representation
Given the disparity map, the focal length and the baseline of the camera, 3D recon-
struction can be performed to transform one of the image coordinate systems to the
world coordinate system [12]. Each individual pixel coordinate is transformed from
the image frame to the world frame using Equations (2.3), (2.4) and (2.5).

x = bX2

d
(2.3)

y = bY2

d
(2.4)

z = bf

d
(2.5)

Using Equations (2.3), (2.4) and (2.5), the 3D information from the images is recov-
ered. The world coordinates are relative to the right camera’s optical center but can
similarly be expressed using the left camera’s. The 3D information can be visualized
using a point cloud. Since the points only describe physical distances and positions,
there is no color information. Color can be applied using simple color mappings
based on, for instance, the points’ distances from the camera. Point clouds are gen-
erally computationally expensive to process. Thus, there is an option to use depth

8


2. Theory

images, which can be encoded with the same information as point clouds. Depth
images can be used in conjunction with convolutional neural networks for extracting
information.

2.3 Convolutional Neural Networks
A convolutional neural network (CNN) is a variant of the artificial neural network
architecture [16]. CNNs are used for image-based machine learning. They are the
current state-of-the-art analysis tool for examining and analyzing images. The net-
work architecture can perform tasks such as object localization and classification.
CNNs are predominantly made up of three components: convolutional-, pooling-
and fully-connected layers. This structure is demonstrated in Figure 2.2, where
each component occurs once. However, they are commonly stacked and repeated to
build up a more complex architecture.

Figure 2.2: Simple CNN architecture with the three main components:
convolutional-, pooling- and fully-connected layers. The convolutional layers slides
a filter across the input, the pooling layer performs downsampling. Lastly, the
fully-connected layer is used for evaluating the produced feature map for generating
predictions.

The inputs to the network are images with three dimensions: spatial (width and
height) and depth (channels). Input images commonly have three channels, corre-
sponding to the RGB color format. Dimensions of the data are altered throughout
the network by the component layers [17]. The first component, convolutional lay-
ers, performs convolution on the input with filters [16]. Furthermore, the filters
have trainable parameters which are learned through backpropagation as training
progresses. The general use of the convolutional layer is to extract features from the
images, such as edges, patterns and more complex shapes. Pooling layers perform
downsampling along the spatial dimension to reduce the complexity for succeeding
layers. The outputs from convolutional- and pooling layers are commonly referred
to as feature maps. Stacking multiple convolutional- and pooling layers results in a
more complex feature map. Fully-connected layers are commonly used at the end
of the network to evaluate the feature map.

9


2. Theory

A common use for CNNs is object detection in images [18]. Object detection consists
of two steps: localization and classification. Localization embodies finding objects
in the image, while classification identifies what the object is. Performing object
detection in real-time has historically been a difficult task to solve. There now exist
networks that manage object detection in real-time.

Training a neural network requires a great amount of data. A common challenge
when using machine learning is gathering enough (labeled) data [19]. Transfer learn-
ing is a powerful method for neural networks which leverages knowledge of already
trained networks to overcome the challenge. The concept involves adapting a net-
work that has already been trained for a task on a new application. The network
can then specialize on the new task with a smaller amount of data, compared to
training it from scratch [20]. Network structures such as VGG-16 and ResNet-101
are commonly used within transfer learning [21].

2.4 The R-CNN Framework
A widely used object detection architecture is the region based convolutional neural
network (R-CNN) family of convolutional networks, presented in [22], [23] and [24].
They perform both localization and classification, and as further developments have
taken place, can predict in real-time. The variations of the R-CNNs are all based
on the same concept. Firstly, the network generates region proposals, or guesses, of
where objects are located within images. Secondly, the proposals are evaluated and
classified. The main difference between the architectures within the R-CNN family
is how region proposals are managed. The three versions: R-CNN, Fast R-CNN and
Faster R-CNN, are summarized in Section 2.4.1.

2.4.1 Versions of R-CNN
The origin and the first version within the R-CNN family of networks is presented in
[22]. It uses the selective search algorithm to generate thousands of region proposals
per image. The proposals are fed through a CNN to extract features. Various
networks can be used as feature extractor, such as VGG-16 [21]. The feature map is
forwarded to a support-vector machine (SVM), which predicts if there is an object
present in each proposed region. Furthermore, for each proposed region, a refinement
of the bonding box is predicted to improve the precision of the box. There are two
main issues with the method: computational speed and training. R-CNN is not
practical to run in real-time, as passing thousands of individual proposals through
the feature extractor takes a long time as no computations are shared. Furthermore,
the network is not trained end-to-end, since the selective search algorithm is fixed.

The computational speed is improved in the second version of the R-CNN frame-
work, named Fast R-CNN [23]. The main change from the previous iteration is how
the region proposals are handled. Rather than feeding the network with thousands
of proposals separately, the entire image is fed directly to the CNN together with
the set of region proposals from the selective search algorithm. The region proposals

10


2. Theory

are extracted from the feature map, rather than the input image. Region of interest
(ROI) pooling is performed on the produced feature map to extract regions of fixed
shape. The result is fed into fully-connected layers for evaluation. The convolution
operation is performed only once, compared to processing thousands of proposals in-
dividually. This reduces the number of operations and improves the computational
speed [23]. Even though this decreases the computational time per image signifi-
cantly, selective search is still used and remains the bottleneck for computational
performance.

The third and latest version of the algorithm further improves the efficiency and is
called the Faster R-CNN architecture [24]. The authors solved the bottleneck caused
by the selective search in Fast R-CNN by developing and using a region proposal
network (RPN) instead. By utilizing the RPN, the algorithm can run in real-time,
contrary to its predecessors. Similar to Fast R-CNN, the output of the proposal
method is passed along to ROI-pooling and a classifier.

All three versions are depicted in Figure 2.3. RPN and ROI-pooling are explained
in Sections 2.4.2 and 2.4.3 respectively.

Figure 2.3: Overview of structure and components of the three versions within
the R-CNN family. The main difference is the handling of region proposals. The
R-CNN and the Fast R-CNN use the fixed selective search to generate proposals,
compared to the Faster R-CNN which uses a region proposal network (RPN). Fur-
thermore, depending on the version, a special region of interest (ROI) pooling layer
is required before evaluation. Finally, evaluation is done through support-vector
machines (SVMs) or fully-connected layers.

11


2. Theory

2.4.2 Region Proposal Network
The region proposal network (RPN) is introduced by Ren et. al. in [24]. As
the name suggests, the RPN generates proposals where objects may potentially be
located in images. It uses a fixed amount of bounding boxes, which are the same
for every image, as a basis for its guesses. The RPN predicts the offset required for
each bounding box to fit an object, as well as an objectness score, indicating if the
box contains an object (foreground) or not (background).

The fixed bounding boxes are generated through a grid of points, called anchors.
Multiple boxes are generated at each anchor with different scales and ratios. The
procedure of how anchor boxes are generated based on scales and aspect ratios is
illustrated in Figure 2.4.

Figure 2.4: Box scales and aspect ratios are combined to generate anchor boxes
of various shapes and sizes, as depicted to the right. These boxes are generated for
each anchor (black dot) in the grid.

In the original paper, three scales and three ratios are used to produce a total of
nine boxes for each anchor [24]. Anchor boxes are evaluated against the ground
truth boxes using the intersection over union (IOU) metric. IOU is an important
metric used in both training and evaluation. It is used to evaluate the similarity of
two bounding boxes, and is computed according to the illustration in Figure 2.5.

Figure 2.5: The intersection over union (IOU) is the intersecting area of the boxes
divided by the union area of the boxes.

12


2. Theory

In the RPN, binary class labels are assigned to each box for training. The assigned
labels are based on the IOU with the ground truth box. Table 2.1 present the IOU
boundaries for foreground and background boxes. Anchor boxes that are considered
neither foreground or background are ignored and do not contribute to training the
network [24].

Table 2.1: Corresponding labels for IOU values.

IOU > 0.7 0.3 ≤ IOU ≤ 0.7 IOU < 0.3
Label Foreground Ignored Background

In the cases where no sample fulfills the condition for the foreground label in Table
2.1, the sample with the highest IOU is chosen.

The training is performed in mini-batches, where one mini-batch arises from a single
image. Each mini-batch contains many foreground and background anchor boxes
that can all be used for training. However, each image is generally dominated by
background samples, and the training would be biased toward these. Hence, a subset
of anchors is chosen randomly to be used for the training. In the original paper,
256 anchors are chosen with an equal ratio of foreground and background anchors.
If there are less than 128 foreground samples, the mini-batch is padded with more
background samples.

2.4.3 ROI-Pooling
The initial set of proposed boxes is shifted with the offset, predicted by the RPN,
to obtain new refined boxes. These boxes have various sizes, but are required to
have a fixed size for the final classification and regression layers. Therefore, a special
pooling layer is used called region of interest (ROI) pooling. The ROI pooling layer
transforms every region of interest defined by the proposal boxes to the same size,
independent of input size, visualized in Figure 2.6.

Figure 2.6: 2x2 region of interest (ROI) pooling. The input is divided in a 2x2
grid and the maximum of each area, marked with red squares, is the output value.

All regions of interest are divided into a fixed-sized grid. Max-pooling is performed
within each area within the grid. It is an operation that returns the maximum value

13


2. Theory

within a specified area. The output of ROI pooling will always be of fixed size. It
can then be fed into fully-connected layers for evaluation for object classification
and bounding box regression.

2.4.4 Evaluation and Loss Functions
Training performance for the entire Faster R-CNN framework is evaluated through
four loss functions. The RPN contains two: one for the regression layer that performs
bounding box regression, and one for the classifier that determines if a bounding
box is foreground or background. The final layers of the Faster R-CNN have the
remaining two: one for the regression layer that further refine the bounding boxes
from the RPN, and one for the classifier that classifies the objects in the foreground
boxes. Both pairs of loss functions use the same metrics: a cross-entropy loss for
the classification layers and the smooth L1 loss for the regression layers. The loss
functions are presented in Equations (2.6) and (2.7), in accordance with [24].

LCE(p∗i , pi) = −
Nclass−1∑

j=0
p∗i log pi (2.6)

L1(t∗i , ti) =


1
2(t∗i − ti)2, if |t∗i − ti| ≤ 1
(|t∗i − ti| − 0.5), otherwise

(2.7)

In Equation (2.6) and (2.7), i is the index of a box in the current mini-batch.
Furthermore, the set of pi is the class probabilities for each box. The predicted shift
is contained in ti. It describes the offset for the center, width and height of the box
to align with a ground truth box. The number of classes is denoted Nclass. Ground
truth values are denoted with an asterisk.

For the binary case, the crossentropy in Equation (2.6) can be written as Equation
(2.8).

LBCE(p∗i , pi) = −(p∗i log pi + (1− p∗i ) log(1− pi)) (2.8)

For the RPN, the binary cross-entropy function is used and the labels are foreground
(1) and background (0). Ground truth for the regression is the offset for each anchor
box to a true box. For the final layers, the categorical cross-entropy function is used
and the labels are multiple classes. Offsets to the same true boxes are used for the
regression. However, the offset values are not the same, since the anchor boxes have
been shifted by the RPN prediction. The loss function for a mini-batch is defined
as the sum of losses from each anchor, according to Equation (2.9).

L({pi}, {ti}) = 1
NCE

∑
i

LCE(pi, p
∗
i ) + 1

Nreg

∑
i

p∗iL1(ti, t∗i ) (2.9)

14


2. Theory

The regression loss is only activated for foreground anchors. Finally, the loss terms
are normalized using NCE and Nreg corresponding to the mini-batch size and the
number of anchor locations.

2.4.5 Summary of Faster R-CNN
To summarize, the full Faster R-CNN structure is illustrated in Figure 2.7.

Figure 2.7: All components and connections to build the Faster R-CNN structure.

The input is fed to a feature extractor, often a pre-trained network. The resulting
feature map is forwarded to an RPN to identify regions that may contain an object
in the image. ROI-pooling is performed on the feature map together with the region
proposals. Finally, the proposals are evaluated.

15


2. Theory

16


3
Methods

This chapter presents the methods and workflow to investigate optical load detection.
The method includes hardware choice and setup, data acquisition, data processing
with neural networks and evaluation.

3.1 Hardware
A vital part of the load detection solution was the choice and mounting of hardware.
Data acquisition, data processing and evaluation depend directly on the choices re-
garding hardware. This section contains motivations of chosen hardware, concerning
characteristics and desired data. Furthermore, the hardware integration including
settings and development of the data acquisition is presented.

3.1.1 Choice of Equipment
The idea was to use depth data for fill-factor estimation and RGB data for the
identification of the material type. Depth data was of interest since it contains
3D information appropriate for determining bucket fill-factor, which might not be
available in regular 2D RGB images. Depth information can be obtained through
various non-contact techniques involving 3D imaging and range measurements, some
of them mentioned in the theory in Section 2.1.

The choice of sensor for the developed system was a stereo camera. One large
benefit is that it captures both RGB and depth data simultaneously. A stereo-
vision system captures images from two perspectives to retrieve 3D information
using triangulation. Capturing images using a camera is simple and fast. However,
the accuracy and the processing time of the depth calculation depend on the used
algorithms and the image content. For example, high accuracy can be difficult to
achieve when capturing surfaces with ambiguous textures.

The sensor used was an Intel RealSense Depth Camera D435i with integrated Intel
RealSense Depth Module D430, a full-HD (1920 × 1080) RGB camera, an infrared
projector and an inertial measurement unit (IMU) [25]. The depth module uses
stereo vision for calculating depth. An infrared projector projects a static infrared
pattern to improve depth accuracy in scenes with ambiguous textures. The baseline

17


3. Methods

of the D435i sensor is 50 mm, focal length 1.93 mm and optimal operating range is
specified to the interval 0.3 m to 3 m [26].

3.1.2 Camera Settings and Data Acquisition Pipeline
Intel provides a software developer kit for convenient operation of their depth cam-
eras [27]. Furthermore, an application programming interface (API) is provided
which was used for streaming and acquisition of data. Initializing camera settings,
starting and displaying the camera stream and capturing data can all be done
through the API. The acquisition pipeline stored an RGB image and the depth
data. The RGB image and the depth data had to be aligned since they were cap-
tured from slightly different perspectives due to the placement of the sensors. RGB
images and depth data were captured with a 1920×1080 and a 1280×720 resolution
respectively, at 30 frames per second (FPS). In the alignment step, the depth data
was upsampled to match the resolution of the RGB image.

The Intel RealSense Depth Camera D435i has an integrated processor with built-in
algorithms for processing the data. The processor provides fast depth calculations
but the specific algorithms used are not publicly available. However, there are
plenty of settings available for adapting the performance for the specific application.
Some setting presets are available, where High Density was used [28]. It is the
recommended preset for object recognition and enhanced 3D photography.

3.2 Depth Image Generation
Depth data was used for generating depth images. Distance values outside of a
specified interval were filtered, since these correspond to measurements outside the
range of the container. A color mapping was applied, called Jet [29]. The color
mapping is depicted in Figure 3.1. It shows that points close to the sensor tend
towards blue while points further away tend towards red. Filtered points are mapped
to black.

Figure 3.1: The color map Jet, used for colorizing depth data [29]. Low distance
values are mapped to blue, while higher values tend towards red.

The distance interval had to be the same for all data captured with a sensor and
bucket setup. This was to ensure the data was processed consistently and was
comparable. All distance values within the interval were normalized and the color
map was applied. A tighter distance interval provides larger contrast in terms of
depth differences in the images, emphasizing the structure of the measured material.
Figure 3.2 depicts examples of generated depth images using two distance intervals
for comparison.

18


3. Methods

(a) RGB image (b) Large interval (c) Tight interval

Figure 3.2: Examples of generated depth images and comparison of filtering in-
terval. The image to the left is the the regular RGB image displaying the scene,
middle image is a produced depth image with a large distance interval and the right
image is a generated depth image with an adapted and tighter distance interval.

The examples in Figure 3.2 depicts cobblestone with large variations in terms of
structure. Comparing the depth images in Figures 3.2b and 3.2c, the tighter interval
provides higher contrast and more details.

3.3 System Architecture
This section describes the system architecture used for estimating fill-factor, volume,
type, weight and where in the images the loaded material is located. An overview
of the system is depicted in Figure 3.3.

Figure 3.3: An overview of the load detection system describing the flow and
processing of the data. By utilizing a depth and RGB image, the weight of loaded
material can be calculated. The system uses CNN to determine both the fill-factor
and material type. Combining this information with a predetermined reference
volume and density results in a weight prediction.

The system in Figure 3.3 received two images as input: a depth image and an RGB
image. Depth images were used for material localization and fill-factor estimation.

19


3. Methods

Passing the depth image through a Faster R-CNN resulted in a bounding box enclos-
ing the bucket and material, as well as a predicted fill-factor. The bounding box was
used to mark the area in the RGB image where the material classifier predicts the
material type. By knowing the material type, material properties can be retrieved.
The weight was calculated by combining the density of the material, the predicted
fill-factor and the reference volume based on the bucket capacity. This is further
explained in Figure 3.4.

Figure 3.4: Weight estimation using fill-factor, reference volume, and material
prediction. The reference volume is external information specified on the bucket,
the fill-factor is estimated through the Faster R-CNN and the material density is
extracted from a database based on the prediction from the material classifier.

Sections 3.3.1 and 3.3.2 describes in detail the steps in the system architecture, in
specific the fill-factor estimation and the material classification.

3.3.1 Fill-Factor Estimation
A Faster R-CNN architecture was used to localize the material and classify the
fill-factor of the container. A pre-trained VGG-16 network was used as feature
extractor. The output of the network was a single bounding box and probabilities
corresponding to intervals of fill-factors. The fill-factor represents to what degree
the container is filled. It can be calculated through both weight and volume, and are
equivalent given a similar density, regardless of volume. For training the network,
a weight to fill-factor association had to be established for the sensor and bucket
setup, due to reference data being weights. The density of the material, ρ, was used
together with the bucket volume capacity, Vref , for calculating a reference weight
wref in Equation (3.1).

wref = ρVref (3.1)

The reference weight obtained through Equation (3.1) was used to relate measured
weights to fill-factors. Since the desired output xest was a continuous value, a
weighted sum was calculated using the probabilities pi and the mid-points zi of
the intervals across all classes Nclass, presented in Equation (3.2).

20


3. Methods

xest =
Nclass∑

i=1
pizi (3.2)

3.3.2 Material Classification

The purpose of the material classifier was to detect the type of the handled material
and retrieve the density. Similar to the fill-factor estimation, the material classifier
utilized transfer learning with a pre-trained VGG-16 network as a feature extractor.
The RGB image captured by the stereo camera was cropped using the predicted
bounding box from the fill-factor network. The cropped image was then passed
through the feature extractor. Lastly, the resulting feature map was fed through a
set of fully-connected layers and a classification layer to predict the material type.

The network predicts probabilities over the material categories. When training the
network, the categories of interests had to be decided. For certain applications it
may be sufficient to detect the class of the material, such as gravel. In others, it can
be necessary to identify further details, such as which fraction of the gravel it is.
Therefore, predictions were evaluated in both groups of materials and as individual
classes.

3.4 Data Acquisition

Convolutional neural networks were utilized in the developed system. Hence, a large
quantity of labeled data was required for training. The following sections describe
how various data acquisition campaigns were conducted.

3.4.1 Initial Investigation

An initial investigation was conducted to evaluate the feasibility of the proposed
setup, given ideal conditions. To obtain maximal depth information and avoid oc-
clusions, the camera was mounted right above a container holding some material.
Figure 3.5 shows the setup together with an example RGB and depth image gener-
ated from depth data, captured by the stereo camera.

21


3. Methods

(a) Setup overview (b) RGB image (c) Depth image

Figure 3.5: Initial static camera setup for feasibility evaluation. Left image depicts
a side view of the setup, the middle image the field of view captured from the RGB
camera and the image to the right the produced depth image.

Various quantities of gravel were used for all measurements. To ensure the only
variations in the measurements were the fill-factor of the container, the camera
and container positions were not altered. For each measurement, the material and
container were weighed using a scale. These weights were used as ground truth
references. The largest weight within the dataset was used for converting the ground
truth weights to fill-factors.

3.4.2 Excavator
The second acquisition campaign was conducted on an excavator. A mounting
platform with magnets was constructed for convenient mounting of the camera on
the machine. The platform was mounted on the stick, as shown in Figure 3.6.

Figure 3.6: Intel RealSense Depth Camera D435i mounted on the excavator, used
for collecting data.

22


3. Methods

A similar type of gravel was used as in the initial investigation. During the data
acquisition, the pose of the bucket was approximately the same for each measure-
ment, which limited the number of influential factors in the resulting data. Figure
3.7 depicts examples of RGB and produced depth images using the sensor setup.

(a) RGB image (b) Depth image

Figure 3.7: Sample images captured by the stereo camera mounted on the exca-
vator. The left is the RGB image and the right the generated depth image.

As ground truth reference, an on-board weighing system in the excavator was used.
The density of the gravel was retrieved by weighing a known volume of it. A reference
weight was then obtained through Equation (3.1). The maximum weight within the
dataset, the number of measurements and the calculated reference weight are found
in Table 3.2.

3.4.3 Wheel Loader
The third campaign was conducted on a wheel loader. An extension was constructed
and attached to the cabin roof for mounting the stereo camera. The extension
reached forward and upward to grant the sensor a field-of-view of the bucket, as
well as protect the sensor from the boom and the bellcrank of the wheel loader. The
setup is depicted in Figure 3.8.

Figure 3.8: Intel RealSense Depth Camera D435i mounted on the wheel loader,
used for collecting data.

23


3. Methods

The same pile of gravel was used as in the excavator campaign. Figure 3.9 depicts
one of the conducted measurements.

(a) RGB image (b) Depth image

Figure 3.9: Sample images captured by the stereo camera mounted on the wheel
loader. The left is the RGB image and the right the generated depth image.

Similar to the excavator, an on-board weighing system was used to retrieve ground
truth references. A reference weight was calculated using Equation (3.1). The
maximum weight within the dataset, the number of measurements and the calculated
reference weight are found in Table 3.2.

3.4.4 Material Collection

For material classification, a large quantity of regular RGB images of various mate-
rial types was required. Materials were photographed at a material dealer, offering
stacked piles of various kinds such as macadam, gravel and sand. 50 images were
captured from assorted perspectives for each material type, using the Intel RealSense
Depth Camera. Furthermore, materials were weighed to build a dataset of densities.
Figure 3.10 depicts some piles used during the material data acquisition.

Figure 3.10: Examples of piles used for acquiring RGB images used for training
the material classification network.

24


3. Methods

3.5 Training

The Faster R-CNN was trained on depth images, which were produced through the
steps described in Section 3.2. The intervals used for the established datasets were
fine-tuned manually and presented in Table 3.1.

Table 3.1: Distance intervals used for filtering depth data.

Static test Excavator Wheel loader
Min distance (m) 0.9 1.4 2.0
Max distance (m) 1.45 3.1 4.2

Within each image, the container with the material was annotated with a bounding
box, used for training the Faster R-CNN. Data augmentation was applied to expand
the training dataset by slightly altering the images with combinations of rotations
and filters. The depth images were resized to reduce the computational cost for the
network, without losing important details in the images. Henceforward, the data
was divided into a training and a test set for evaluation. Table 3.2 summarize the
sizes of the datasets and hyperparameters used for training the fill-factor estimator
(Faster R-CNN).

Table 3.2: Sizes of datasets used and hyperparameters used for training Faster
R-CNN.

Static test Excavator Wheel loader
Training images 180 386 280
Training images
(with augmentations) 1 620 3 474 2 133

Test images 20 69 43
Number of categories 10 12 15
Image size (resized) 300, 533 300, 533 300, 533
Anchor box scales 150, 180, 200 190, 200, 210 190, 200, 210
Anchor box ratios 1:1, 1:1.2, 1.2:1 1:1, 1√

2 : 2√
2 ,

2√
2 : 1√

2 1:1, 1√
2 : 2√

2 ,
2√
2 : 1√

2
Ref. weight [kg] 65 1 394 4 420
Max weight [kg] 65 1 670 6 520

Note, since the reference weight was less than the maximum weight for the excavator
and the wheel loader campaigns, fill-factors over 1.0 occurred. Furthermore, pre-
dicted weights below 200 kg for the excavator and the wheel loader were considered
to be empty buckets since no measurements were conducted below this weight. For
the material classification, images are divided into datasets according to Table 3.3.

25


3. Methods

Table 3.3: Dataset sizes and number of classes used in the material classification
network.

Parameters Quantity
Training images 595
Training images
(with augmentations) 4 760

Test images 171
Number of materials 17

3.6 Evaluation
Various metrics were used for evaluating the fill-factor predictions. Approximation
or sample errors is a rudimentary metric based on the differences between the ground
truth and the predicted values. The mean of the approximation errors are calcu-
lated as mean absolute error (MAE) and mean absolute percentage error (MAPE),
according to Equations (3.3) and (3.4) respectively.

MAE = 1
N

N−1∑
i=0
| x∗i − xi | (3.3)

MAPE = 1
N

N−1∑
i=0
| x
∗
i − xi

x∗i
| (3.4)

In Equations (3.3) and (3.4), x∗ and x are the ground truth and predicted values
respectively, and N is the total amount of samples. An option is to calculate the
MAPE relative the maximum value within the dataset, x∗max, to not punish large
relative errors caused by predictions of low values. The metric is denoted MAPEmax

and is presented in Equation (3.5).

MAPEmax = 1
N

N−1∑
i=0
| x
∗
i − xi

x∗max

| (3.5)

The relative error can only be obtained for non-zero ground truth values, which is
not possible when including empty buckets in the evaluation. Additional metrics
was used to incorporate all data such as standard deviation (STD or σ), which is
calculated using the mean (µ) according to Equations (3.6) and (3.7).

µ = 1
N

N−1∑
i=0

x∗i − xi (3.6)

σ =

√√√√ 1
N

N−1∑
i=0

(xi − µ)2 (3.7)

26


3. Methods

The distribution of the data was described with the mean and the STD. For instance,
95% of the data is within two standard deviations (2σ) from the mean, assuming
the data is normally distributed. It was used for representing within which interval
most of the errors were contained.

The overall performance of the system was directly influenced by how well the regions
of interest (bounding boxes) are found. Therefore, the mean IOU was calculated for
all bounding boxes as an indication of how well the predicted bounding boxes match
the ground truth. The calculation of IOU is illustrated in Figure 2.5. The bounding
boxes were visually inspected to further evaluate if the predictions of bucket and
material location are valid.

The material classification was evaluated using the accuracy of class predictions.
The accuracies for all predictions can be summarized in a confusion matrix. It eval-
uates true and predicted classes in a grid-like fashion. Each entry is calculated as
the relative frequency across each true class. Ideally, the diagonal should be ones,
while off-diagonal values should be zero. In such a case, the predicted class co-
incides perfectly with the true class. Values occurring on the off-diagonal are the
incorrect predictions, which provide insight into which classes the network confuses.
A schematic of the confusion matrix with some example predictions is found in Fig-
ure 3.11. In the example, material 2 was misclassified with material 1 and material
3 20% of the time respectively, while material 3 was misclassified as material 2 10%
of the time.

Figure 3.11: An example confusion matrix. Material 1 was predicted correctly,
while material 2 and 3 had some misclassifications.

27


3. Methods

28


4
Results

This chapter presents results from fill-factor and weight estimations from the trained
network, using the data collected during the acquisition campaigns. Furthermore, the
results from material classifications are presented.

4.1 Fill-Factor and Weight Estimations

The results for the fill-factor and weight estimations are presented individually for
each conducted acquisition campaign. Details about the datasets and parameters
used for training the network are available in Table 3.2. Figures 4.1-4.3 presents two
images from each campaign with their corresponding bounding box and fill-factor
predictions.

(a) One of the largest deviations. (b) One of the best predictions.

Figure 4.1: Two samples from the static campaign with ground truth (GT) and
predictions (est.). Ground truth bounding boxes are drawn in white and predicted
bounding boxes in yellow. The estimated fill-factors and the resulting weight esti-
mations are presented together with the sample error (diff.).

From Figure 4.1, it is observed that the depth images are similar in appearance.
The differences are the depth variations within the containers, where higher fill-
factor results in a more blue nuance.

29


4. Results

(a) One of the largest deviations. (b) One of the best predictions.

Figure 4.2: Two samples from the excavator campaign with ground truth (GT)
and predictions (est.). Ground truth bounding boxes are drawn in white and pre-
dicted bounding boxes in yellow. The estimated fill-factors and the resulting weight
estimations are presented together with the sample error (diff.).

In Figure 4.2, the bucket is distinguished from the background through the depth
filtering. In Figure 4.2b, some of the background is included due to the bucket
being close to the material pile or ground. The orientations of the buckets are
noticeable through their profiles in the images. A curled bucket results in a wider
and non-uniform profile while a bucket parallel to the sensor is more rectangular.

(a) One of the largest deviations. (b) One of the best predictions.

Figure 4.3: Two samples from the wheel loader campaign with ground truth (GT)
and predictions (est.). Ground truth bounding boxes are drawn in white and pre-
dicted bounding boxes in yellow. The estimated fill-factors and the resulting weight
estimations are presented together with the sample error (diff.).

In Figure 4.3, outlier rejection is noticed to be more difficult since much of the
background is included in the depth images. Furthermore, the bucket is not centered
in the sample images. Compared to the other setups, the bucket moves laterally.
However, the material is still localized properly.

It is observed for each sample in Figures 4.1-4.3 that the predicted bounding boxes
visually overlap to a large extent with the true reference boxes. The predicted fill-
factors are used to calculate the estimated weights according to the methodology

30


4. Results

established in Section 3.3. Evaluating the test images using the metrics described
in Section 3.6 yields the results presented in Table 4.1.

Table 4.1: The performance of the fill-factor and weight estimations, evaluated
through presented metrics.

Metric
Campaign Static Excavator Wheel loader

MAE 0.03 0.04 0.05Fill-factor STD 0.03 0.05 0.06

Weight MAE [kg] 1.8 55.3 199.2
STD [kg] 2.1 70.6 258.8
MAPE [%] 5.1 8.9 6.0
MAPEmax [%] 2.5 3.3 3.0
Mean IOU 0.86 0.82 0.83

In Table 4.1, the MAE and the STD are presented individually for fill-factor and
weight, while relative metrics and mean IOU are independent of data type. It is
observed that the static setup yields the least deviating predictions considering all
metrics. The MAE and STD in terms of fill-factor are similar for the excavator and
the wheel loader. However, since the wheel loader accumulate more material due
to a larger volume capacity of the bucket, the errors in absolute weights are larger.
Nonetheless, both MAPE and MAPEmax, are less for the wheel loader compared to
the excavator. The mean IOU further confirms the observation that the bounding
boxes overlap to a large extent with the reference boxes. Supplementary results in
terms of sample and relative errors, with regard to the loaded weight, are presented
in Figures 4.4, 4.5 and 4.6.

(a) Sample error x∗ − x with mean and
confidence interval.

(b) Absolute relative error fit to a de-
caying exponential.

Figure 4.4: Sample and absolute relative errors with respect to loaded (true)
weights for the static campaign.

31


4. Results

(a) Sample error x∗ − x with mean and
confidence interval.

(b) Absolute relative error fit to a de-
caying exponential.

Figure 4.5: Sample and absolute relative errors with respect to loaded (true)
weights for the excavator campaign.

(a) Sample error x∗ − x with mean and
confidence interval.

(b) Absolute relative error fit to a de-
caying exponential.

Figure 4.6: Sample and absolute relative errors with respect to loaded (true)
weights for the wheel loader campaign.

In Figures 4.4a, 4.5a and 4.6a, the sample errors, x∗− x, are depicted together with
the mean and the two standard deviation confidence intervals. A negative sample
error is an overestimate of the weight whereas a positive error is an underestimation.
The slope of the drawn mean error indicates how the sample errors correlate with
loaded weights. It is noticed that for low weights, the system tends to overestimate,
and for higher weights, underestimate. Moreover, in Figures 4.4b, 4.5b and 4.6b, a
decaying exponential is fit to the relative errors. The exponential shows the trend
of the relative errors, which is observed to decrease as the loaded weight increases.

The sum of the sample errors is also interesting. It indicates how the system performs
over multiple measurements. Table 4.2 presents the sum of the sample errors, a sum
of the loaded weights and the relative error, for each campaign. The number of
samples for each campaign is presented in Table 3.2.

32


4. Results

Table 4.2: Sum of sample errors for each campaign together with the sum of the
loaded weights and relative error.

Static test Excavator Wheel loader
Sum of errors [kg] −13 −260 −3 500
Total loaded weight [kg] 790 52 700 176 000
Relative error [%] −1.6 −0.5 −2.0

From Table 4.2, it is observed that throughout all campaigns, the system overes-
timates the weight. For the most part, errors tend to even out, resulting in a low
relative error.

4.2 Material Classification
The second part of the load detection system is the material classification. The full
list of materials used for training and evaluating the developed system is available in
Appendix A, together with material sample images and densities. For convenience
and visualization purposes, the materials are categorized into seven main categories
based on material properties. The predictions are conducted for the individual
material types, where the categorized predictions are presented in the confusion
matrix in Figure 4.7.

Figure 4.7: Confusion matrix with seven classes. Correct predictions lie on the
diagonal.

33


4. Results

The main observation in Figure 4.7 is that the large values are on the diagonal.
Thus, the network is accurate in its categorized material predictions. There appears
to be a small confusion between macadam and gravel, as well as gravel and sand.

34


5
Discussion

This chapter contains a discussion around the results from each campaign and the
complete solution. The choice of hardware and placement is discussed. Finally, the
developed system is considered from a larger perspective and future work is suggested.

5.1 Fill-Factor Estimations

The results from the initial test provide insight into the feasibility of the presented
methodology. The static setup is believed to be optimal, since no occlusions from
the container occur. Furthermore, fill-factor variations only affect the height, which
is directly reflected in the distance found by the sensor. Hence, variations are fully
observable from the sensor position. Performance of the initial static test is consid-
ered decent, since MAPE and MAPEmax are 5.1 % and 2.5 % respectively. This
indicates that an optical load detection system is possible with the proposed setup.

Localization of the container in the static setup is expected to be good. This is due
to the container not being moved between measurements, contrary to the excavator
or wheel loader bucket. Bounding boxes are found to a high degree, both visually
and numerically, as shown in the samples in Figure 4.1 and Table 4.1. Similar can
be seen in both the excavator and wheel loader samples in Figures 4.2 and 4.3. The
bounding boxes for the excavator and wheel loader are visually offset more from
the ground truth boxes, compared to the initial static test. However, the overlap is
high, as shown by the mean IOU in Table 4.1.

It is evident from Table 4.1 that the static test has the best performance throughout
all metrics and campaigns. The relative errors for the excavator and wheel loader
campaign are slightly worse. Fill-factor errors correlate to larger errors in abso-
lute weight for the machines, since the quantity of material is considerably higher.
Estimations are expected to be more difficult on the machines, due to the higher
complexities of the setups.

From the error plots in Figures 4.4, 4.5 and 4.6, it is clear that the network has a
biased prediction. For low weights, it tends to overestimate and for higher weights
underestimate. The exception is the static test, where all predictions are overesti-
mations. The excavator and wheel loader are usually loaded to a large fill-factor to

35


5. Discussion

utilize their load capacity, and the most accurate predictions are provided in this
interval. Moreover, the relative error decreases as the loaded weight increases. Thus,
operations in the upper range of fill-factor (and weight) are optimal for the system.
It is observed that data distribution plays a large role in the general appearance
of the sample error plots. Ideally, data should be evenly spread across the entire
interval to obtain a balanced dataset and avoid bias during training. If it was the
case, the slope of the mean error line would become flatter.

Table 4.2 presents the sum of the sample errors for each campaign. These sums are
interesting since high accuracy for individual measurements might not be required,
but rather for a couple of measurements. When loading a truck or similar, the final
weight is usually most important. The table shows that the system overestimates the
weights, but not significantly. The relative errors are deemed excellent, especially
for the excavator.

5.2 Material Classification
The developed CNN architecture appears to yield sufficient feature maps to identify
the investigated set of materials. The test set is composed of images captured
from the same material piles as the training images, but from altered perspectives.
Hence, similar images appear in both training and evaluation, which may affect the
credibility of the result. Evaluation using the test set reveals high classification
accuracy, with few incorrect predictions, as can be seen in the confusion matrix in
Figure 4.7. It is observed that the confused material classes tend to be of the same or
similar types. The main difficulties in prediction appear to be between materials with
similar appearances, such as classifying sand-like materials or determining fractions
and sizes of gravel and macadam.

There are decisions to be made about materials of interest and which materials
can be categorized into larger groups. It is expected that the type of the handled
materials is not altered too much. For instance, the predictions could be limited to
the expected set of available materials within a work site. Thus, the neural network
could be solely trained on a smaller set of materials of interest.

5.3 Hardware
The purpose of the developed system was to provide a prototype. As such, the
stereo camera is one choice out of many that can provide the necessary data for the
methodology. The camera is deemed necessary for material classification. However,
depth information can be gathered from a different sensor, which may provide higher
resolution and more accurate data. The stereo camera used is a reasonable choice,
since the methodology requires RGB images and depth data. It also provides built-
in algorithms for calculating depth, resulting in a compact and convenient assembly,
suitable for prototyping.

36


5. Discussion

Together with the choice of sensor, placement is an important consideration. The
sensor has to be out of the way of moving machinery and surrounding objects.
The stated operating range of the stereo camera is potentially a limiting factor.
For the wheel loader, the distance between the sensor and the bucket exceeds the
recommended operating range set by Intel for accurate depth measurements. It
is unclear how this affects the system’s performance, and has to be considered in
further developments. The sensor placement for the wheel loader is difficult, due
to the bucket pose relative to the sensor. The optimal placement to not obstruct
the machinery would be on the cabin. However, this would result in an occluded
view into the bucket, limiting the amount of information that can be extracted with
the depth sensor considerably. The constructed extension that holds the camera is
pointing up and forward from the roof. This is not a practical implementation. It
both inhibits movement of the arm of the wheel loader and may interfere with the
surroundings.

Considering the excavator, the setup is not far from the scenario from the static test.
The camera can capture the majority of the bucket with few occlusions. However,
the bucket may be at different angles, and it is non-uniform compared to the static
test. This may provide additional challenges for the fill-factor network. Placement
on the stick is beneficial due to the distance from the bucket to the camera is
similar throughout all movements of the excavator. Nevertheless, the camera has to
be moved away from the boom-stick joint as there is a risk of crushing the sensor.
If it is to be placed on the cabin, the distance to the bucket would vary widely. The
consequence would be an additional challenge of handling variations in distances
between measurements.

5.4 The Complete Solution
The implemented Faster R-CNN manages the fill-factor estimation and bucket local-
ization jointly. Scales and ratios are tuned to match the sizes of objects of interest in
the images. The objects are of similar size throughout all images, making it possible
to tune the parameters to a small range. One inconvenience with the architecture is
the classification layer at the end of the network. Since the desired value is contin-
uous, a regression could be used in place of weighted sums. However, this requires
modification to the existing framework.

Intervals of fill-factors and weighted sums are deemed as appropriate solutions to
retrieve a continuous prediction. The intervals predicted by the network are de-
termined based on the dataset. Tighter fill-factor intervals result in an increased
number of classes. This increases the complexity of the network and consequently
the amount of required data. However, a higher quantity of intervals potentially pro-
vides finer calculations of the continuous output. The Faster R-CNN architecture
is used for the developed system and no other variants were investigated. However,
the methodology is not limited to the chosen network architecture.

The reference data is obtained through on-board weighing systems, which have mea-

37


5. Discussion

surement errors. The performance of the fill-factor estimation is directly influenced
by the quantity and quality of the reference data. The evaluation does not take into
account measurement errors of the on-board weighing system, since it is considered
as ground truth. Further errors are caused by the weight and fill-factor conversion,
as the density is assumed constant across measurements. The impact of this as-
sumption is not entirely evident, since there are variations in the compactness and
composition of the materials.

Using fill-factor, rather than weight or volume when training the network, has several
benefits. Firstly, all measurements become independent of what type of material is
loaded. The depth profile is similar regardless of what is in the bucket, given the
same volume of material. Secondly, different bucket sizes with a similar depth profile
can be evaluated with the same network. Thirdly, the reference data can be obtained
through either volume or weight, depending on availability.

No investigation regarding environmental conditions, such as light and weather, was
conducted. However, it is hypothesized that the conditions affect several parts of the
system. Firstly, the density is affected, which may cause further prediction errors.
For instance, the data acquisition for the excavator conducted over several sessions.
Since the humidity in the gravel changes between the sessions, the assumption about
constant density across measurements fails. Secondly, the quality of the RGB image
is affected by light and weather, possibly obfuscating it. This may result in an
incorrect material prediction. Contrary, fill-factor estimations are, in theory, more
robust against changes in the scene. A requirement is that the sensor acquiring
depth data is suited for the conditions. Dust, dirt and similar could disturb the
stereo camera. The depth information is lost if one of the sensors in a stereo camera
is occluded. Furthermore, using a stereo camera for depth sensing will not be possible
in low-light conditions. The reason is that insufficient details for aligning the image
pair are visible. Other types of sensors, such as a LIDAR or RADAR, may be more
suitable in that case.

5.5 System Improvements and Future Work
The results of the fill-factor estimation and material classification demonstrate that
the optical load weighing solution is viable. Nonetheless, there are improvements
available for the solution. The developed system is constructed with a limited
amount of information about the construction machine. Utilizing more informa-
tion could prove beneficial. For example, in the work by Rasul et al. in [5] and
presented in Section 1.2, sensor fusion of a stereo camera, a LIDAR and pressure
sensors in the machine was performed. In a similar fashion, sensor fusion between an
optical and an on-board weighing system can provide more accurate and robust pre-
dictions. Furthermore, pre-processing of depth data could take into account bucket
pose. The pose could either be estimated through an optical system or measured
by IMUs mounted on the machines. Additional information about the machine can
be provided by the operator, such as bucket dimensions and sensor placement, to
provide a more robust solution.

38


5. Discussion

Being able to apply the trained network to various machines and bucket compositions
would be desirable. Large quantities of data are imperative for training the networks.
A suggestion is to continuously capture data with the sensor while operating the
machines within the industry. In that way, a diverse data set can be established
with variations in machines, buckets, environmental conditions and material types.
The reference data should ideally be in terms of volume to bypass the influence of
material properties when converting weight to fill-factor. Additionally, to achieve
generalization, the distance interval is adjusted such that it reflects the fill-factor
levels of the setup used for training. A criterion for this to be valid is that the bucket
profile has to be similar, as the sensor has no way to detect what is happening below
the material surface. The intervals used in the conducted experiments were tuned
manually by visual inspections of the depth images. Since the intervals should be
consistent for variations in machine models, bucket models and sensor placements,
a set of calibration steps would be required. It could either be done manually, by
examining reference depth data or by bucket alignment based on point clouds.

Considering the material classification, the operator can be notified and prompted to
verify that the predicted material is correct. The network could take into account
that the type of the loaded material will not be altered frequently in a session.
Moreover, the material classification can consider location information, since it is
expected that there is a limited set of materials within a site. The location could
be tracked through a positioning system or provided by the operator. Combining
material classifications with location information can be used for material track-
ing purposes. Future work can also be dedicated to evaluating the network with
additional materials captured in various weather and light conditions. For certain
materials, such as macadam and gravel, there is a possibility to apply and evaluate
neural networks or threshold and segmentation algorithms for determining fractions.

When implementing the optical load weighing system in the machines, it can be
presumed that detection is performed in real-time. A decision has to be made on how
to predict the weights over a time series. For example, fusing multiple predictions
over time, or a moving average of the estimation could be used. Knowledge of the
pose of the bucket could be beneficial when capturing the data. This is to determine
when the bucket is in an optimal position for the field of view of the sensor.

Applying the system to other machines is a possibility. The principle of determining
fill-factor and weight for a bucket or container is similar regardless of shape. For
instance, applying the proposed system on a truck or similar is deemed a possibil-
ity. Lastly, the sensor is not required to be mounted on the machine of interest.
The optical load detection could be constructed as an external system, functioning
similarly to scales placed around work sites.

39


5. Discussion

40


6
Conclusion

An optical load detection system is developed to measure loaded weight in the bucket
of two construction machines: an excavator and a wheel loader. The system relies
on depth and RGB data for estimating the type, location, volume and weight of
loaded material. The choice of hardware for acquiring the required data is a stereo
camera. It captures depth data and RGB images simultaneously. Depth images
are produced by filtering and mapping the depth data through a color map. For
fill-factor estimation, a Faster R-CNN architecture takes the depth image as input
and predicts fill-factors and a bounding box enclosing the material.

Three campaigns are conducted for three different setups. The first campaign uti-
lizes, what is considered, the optimal setup for capturing depth information from
the material. This campaign confirms the feasibility, and indicates the achievable
performance of the system. The other campaigns correspond to the excavator and
wheel loader. The sensor is placed on the stick on the excavator. This is deemed a
good solution to get a sufficient field of view of the bucket. The wheel loader sensor
placement requires consideration, as the current solution inhibits the movement of
the bellcrank.

Evaluation reveals that the system can estimate fill-factor to a mean absolute per-
centage error (MAPE) relative to the maximum value of 3.3 % and 3.0 % for the
excavator and the wheel loader respectively. Furthermore, the predicted bounding
box is combined with the captured RGB image to filter out the background from
the image. It is then fed through an additional network that predicts the type of
material. The knowledge about material type is essential for providing density in-
formation to the weight calculation. The classifier is capable of identifying a range
of construction materials to high accuracy, only confusing similar material types.

Future work should be dedicated to generalizing the system. For example, being
able to use the system on additional machines, such as trucks and haulers, is a
possibility. Other sensors and network architectures should also be investigated.
Lastly, the system should be evaluated on a more diverse dataset, including various
bucket sizes and weather conditions for example.

To conclude, the investigation reveals that an optical load weighing system is feasible
given three criteria: a fill-factor estimation, a conversion between fill-factor and
reference volume and the density of the material.

41


6. Conclusion

42


Bibliography

[1] E. Rundgren, “Automatic volume estimation of timber from multi-view stereo
3d reconstruction,” M.S. thesis, Computer Vision Laboratory, Linköping Uni-
versity, Linköping, Sweden, 2017.

[2] P. Artaso and G. López-Nicolás, “Volume estimation of merchandise using
multiple range cameras,” Measurement, vol. 89, pp. 223–238, Jul. 2016.

[3] Y. Liang and J. Li, “Deep learning-based food calorie estimation method in
dietary assessment,” ArXiv, vol. abs/1706.04062, 2017.

[4] J. Lu, Z. Yao, Q. Bi, and X. Li, “A neural network–based approach for fill
factor estimation and bucket detection on construction vehicles,” Computer-
Aided Civil and Infrastructure Engineering, vol. 36, no. 12, pp. 1600–1618,
Dec. 2021. doi: 10.1111/mice.12675.

[5] A. Rasul, J. Seo, and A. Khajepour, “Development of integrative method-
ologies for effective excavation progress monitoring,” Sensors, vol. 21, no. 2,
p. 364, Jan. 2021. doi: 10.3390/s21020364.

[6] R. B. Fisher and K. Konolige, “Range sensors,” in Springer Handbook of
Robotics, B. Siciliano and O. Khatib, Eds. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2008, pp. 521–542. doi: 10.1007/978-3-540-30301-5_23.

[7] P. Zanuttigh, G. Marin, C. D. Mutto, F. Dominio, L. Minto, and G. M. Corte-
lazzo, Time-of-flight and structured light depth cameras technology and appli-
cations. Springer International Publishing, 2018.

[8] M. Aboali, N. Abd Manap, A. Darsono, and Z. Yusof, “Review on three dimen-
sional (3-d) acquisition and range imaging techniques,” International Journal
of Applied Engineering Research, vol. 12, pp. 2409–2421, Jun. 2017.

[9] S. Zhang, “High-speed 3d shape measurement with structured light methods:
A review,” Optics and Lasers in Engineering, vol. 106, pp. 119–131, 2018. doi:
https://doi.org/10.1016/j.optlaseng.2018.02.017.

[10] V. John, Q. Long, Y. XU, Z. Liu, and S. MITA, “Sensor fusion and registration
of lidar and stereo camera without calibration objects,” IEICE Transactions
on Fundamentals of Electronics, Communications and Computer Sciences,
vol. E100.A, pp. 499–509, Feb. 2017. doi: 10.1587/transfun.E100.A.499.

[11] R. I. Hartley and A. Zisserman, Multiple view geometry in Computer Vision,
2nd ed. Cambridge University Press, 2004.

[12] J. Zhang, R. Du, and R. Gao, “Passive 3d reconstruction based on binocular
vision,” in International Conference on Graphic and Image Processing (ICGIP
2018), vol. 11069, May 2019, p. 124. doi: 10.1117/12.2524355.

43

https://doi.org/10.1111/mice.12675
https://doi.org/10.3390/s21020364
https://doi.org/10.1007/978-3-540-30301-5_23
https://doi.org/https://doi.org/10.1016/j.optlaseng.2018.02.017
https://doi.org/10.1587/transfun.E100.A.499
https://doi.org/10.1117/12.2524355


Bibliography

[13] S. Mattoccia, “Stereo vision: Algorithms and applications,” University of Bologna,
vol. 22, 2013. [Online]. Available: http://vision.deis.unibo.it/~smatt/
Seminars/StereoVision.pdf (visited on 04/13/2022).

[14] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms,” in Proceedings IEEE Workshop
on Stereo and Multi-Baseline Vision (SMBV 2001), 2001, pp. 131–140. doi:
10.1109/SMBV.2001.988771.

[15] K. Y. Kok and P. Rajendran, “A review on stereo vision algorithm: Challenges
and solutions,” ECTI Transactions on Computer and Information Technology
(ECTI-CIT), vol. 13, pp. 112–128, Nov. 2019. doi: 10.37936/ecti- cit.
2019132.194324.

[16] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a convo-
lutional neural network,” in 2017 International Conference on Engineering
and Technology (ICET), 2017, pp. 1–6. doi: 10.1109/ICEngTechnol.2017.
8308186.

[17] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,”
ArXiv, vol. abs/1511.08458, 2015.

[18] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,”
ArXiv, vol. abs/1905.05055, 2019.

[19] M. Hussain, J. J. Bird, and D. R. Faria, “A study on cnn transfer learning for
image classification,” in Advances in Computational Intelligence Systems, A.
Lotfi, H. Bouchachia, A. Gegov, C. Langensiepen, and M. McGinnity, Eds.,
Cham: Springer International Publishing, 2019, pp. 191–202.

[20] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features
in deep neural networks?” In Proceedings of the 27th International Conference
on Neural Information Processing Systems - Volume 2, ser. NIPS’14, Montreal,
Canada: MIT Press, 2014, pp. 3320–3328.

[21] S. T. Krishna and H. K. Kalluri, “Deep learning and transfer learning ap-
proaches for image classification,” International Journal of Recent Technology
and Engineering (IJRTE), vol. 7, no. 5S4, pp. 427–432, 2019.

[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recog-
nition, Nov. 2013. doi: 10.1109/CVPR.2014.81.

[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference
on Computer Vision (ICCV), Dec. 2015.

[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in Neural Information
Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R.
Garnett, Eds., vol. 28, Curran Associates, Inc., 2015.

[25] Intel, Intel realsense d400 series product family datasheet. [Online]. Avail-
able: https://dev.intelrealsense.com/docs/intel-realsense-d400-
series-product-family-datasheet (visited on 02/28/2022).

[26] Intel,Depth camera d435i, 2021. [Online]. Available: https://www.intelrealsense.
com/depth-camera-d435i/ (visited on 02/28/2022).

44

http://vision.deis.unibo.it/~smatt/Seminars/StereoVision.pdf
http://vision.deis.unibo.it/~smatt/Seminars/StereoVision.pdf
https://doi.org/10.1109/SMBV.2001.988771
https://doi.org/10.37936/ecti-cit.2019132.194324
https://doi.org/10.37936/ecti-cit.2019132.194324
https://doi.org/10.1109/ICEngTechnol.2017.8308186
https://doi.org/10.1109/ICEngTechnol.2017.8308186
https://doi.org/10.1109/CVPR.2014.81
https://dev.intelrealsense.com/docs/intel-realsense-d400-series-product-family-datasheet
https://dev.intelrealsense.com/docs/intel-realsense-d400-series-product-family-datasheet
https://www.intelrealsense.com/depth-camera-d435i/
https://www.intelrealsense.com/depth-camera-d435i/


Bibliography

[27] Intel, Intel realsense sdk 2.0. [Online]. Available: https://www.intelrealsense.
com/sdk-2/ (visited on 04/28/2022).

[28] Intel,D400 series visual presets. [Online]. Available: https://dev.intelrealsense.
com/docs/d400-series-visual-presets (visited on 02/28/2022).

[29] Colormaps in opencv. [Online]. Available: https://docs.opencv.org/4.x/
d3/d50/group__imgproc__colormap.html (visited on 02/28/2022).

45

https://www.intelrealsense.com/sdk-2/
https://www.intelrealsense.com/sdk-2/
https://dev.intelrealsense.com/docs/d400-series-visual-presets
https://dev.intelrealsense.com/docs/d400-series-visual-presets
https://docs.opencv.org/4.x/d3/d50/group__imgproc__colormap.html
https://docs.opencv.org/4.x/d3/d50/group__imgproc__colormap.html


Bibliography

46


A
Material Images, Densities and

Full Confusion Matrix

(a) Gravel
type 1 0-16 mm

(b) Gravel
type 2 0-32 mm

(c) Gravel
type 3 8-11 mm

(d) Gravel
type 4 2-5 mm

(e) Gravel
type 5 8-16 mm

(f) Gravel
type 6 16-32 mm

(g) Gravel
type 7 8-16 mm

(h) Gravel
type 9 8-11 mm

(i) Barkdust
type 1 20-50 mm

(j) Sand
type 1 0-8 mm

(k) Sand
type 2 0-2 mm

(l) Sand
type 3 0-2 mm

(m) Cobblestone
type 1 100-250 mm

(n) LECA
type 1 12-20 mm

(o) Macadam
type 1 16-22 mm

(p) Macadam
type 2 8-16 mm

Figure A.1: The materials used for training and evaluating the material classifica-
tion network. The input to the network is extracted patches from the center of the
images.

I


A. Material Images, Densities and Full Confusion Matrix

Figure A.2: Gravel type 8 0-32 mm, used for training and evaluating the fill-factor
network.

Table A.1: Material categories and approximate density intervals (tons per cubic
meter).

Material Category Est. Density [t/m3]
Macadam 1.3-1.4
Barkdust 0.5-0.6
Gravel 1.2-1.5
LECA 0.3-0.4
Cobblestone 1.4-1.6
Sand 1.2-1.3

II


A. Material Images, Densities and Full Confusion Matrix

Figure A.3: Confusion matrix including all individual material types. Class names
indicate the type of material and the fineness. For instance "Gravel Type 1 0-16"
refers to one type of gravel in the dataset with grain sizes in the interval 0 mm to
16 mm.

III


DEPARTMENT OF VEHICLE ENGINEERING AND AUTONOMOUS SYSTEMS
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	List of Figures
	List of Tables
	Introduction
	Background
	Related Work
	Purpose and Goals
	Limitations

	Theory
	Range Sensors
	Stereo Vision
	3D Perception
	Binocular Vision
	Stereo Matching Methods
	3D Reconstruction and Data Representation

	Convolutional Neural Networks
	The R-CNN Framework
	Versions of R-CNN
	Region Proposal Network
	ROI-Pooling
	Evaluation and Loss Functions
	Summary of Faster R-CNN


	Methods
	Hardware
	Choice of Equipment
	Camera Settings and Data Acquisition Pipeline

	Depth Image Generation
	System Architecture
	Fill-Factor Estimation
	Material Classification

	Data Acquisition
	Initial Investigation
	Excavator
	Wheel Loader
	Material Collection

	Training
	Evaluation

	Results
	Fill-Factor and Weight Estimations
	Material Classification

	Discussion
	Fill-Factor Estimations
	Material Classification
	Hardware
	The Complete Solution
	System Improvements and Future Work

	Conclusion
	Bibliography
	Material Images, Densities and Full Confusion Matrix