AI-enhanced Algorithm for
Structural Health Monitoring
An Image-based Concrete Crack Detection Method Using Con-
volutional Neural Networks
Master’s thesis in Structural Engineering and Building Technology
XI LUO
JIA GUO
DEPARTMENT OF ARCHITECTURE AND CIVIL ENGINEERING
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021
www.chalmers.se

Master’s thesis 2021
AI-enhanced Algorithm for Structural Health
Monitoring
An Image-based Concrete Crack Detection Method Using
Convolutional Neural Networks
XI LUO
JIA GUO
Department of Architecture and Civil Engineering
Division of Structural Engineering
Concrete Structures Research Group Chalmers University of Technology
Gothenburg, Sweden 2021
AI-enhanced Algorithm for Structural Health Monitoring
An Image-based Concrete Crack Detection Method Using Convolutional Neural Net-
works
XI LUO
JIA GUO
© XI LUO, JIA GUO, 2021.
Supervisor: Kamyab Zandi, Department of Architecture and Civil Engineering
Examiner: Kamyab Zandi, Department of Architecture and Civil Engineering
Master’s Thesis 2021
Department of Architecture and Civil Engineering
Division of Structural Engineering
Concrete Structures Research Group Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000
Cover: A convolutional neural network architecture for image segmentation.
Typeset in LATEX, template by Magnus Gustaver
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021
iv
AI-enhanced Algorithm for Structural Health Monitoring
An Image-based Concrete Crack Detection Method Using Convolutional Neural Net-
works
XI LUO
JIA GUO
Department of Architecture and Civil Engineering
Chalmers University of Technology
Abstract
The thriving of image-based damage detection methods for structure health moni-
toring especially the application of UAVs for structure inspection has been a trend
in the recent years. The concept of Digital Twin is to build a living digital repre-
sentation for structures which requires a very fast data processing procedure. This
paper proposes a CNN-based crack detection method that can recognize and extract
cracks from photos of concrete structures, which can enhance the data processing
for Digital Twin. The algorithm consists of two subsequent procedures, classifica-
tion and segmentation, achieved by two convolutional neural networks respectively.
First, full images are divided into patches and classified as positive and negative.
Then, those sub-images classified as positive where cracks are visible are further
processed by the image segmentation procedure to obtain the pixel level shapes of
the cracks. For the classification part, the performance of transfer learning models
based on pre-trained VGG16, Inception V3, MobileNet and DenseNet169 is com-
pared with different classifier. Finally, the CNN based on MobileNet was trained
with 30,000 training images and can reach 97% testing accuracy and 0.96 F1 score
on testing image. For the segmentation part, different neural networks based on the
elegant U-net architecture are built and tested. The models are trained with 3840
crack images and annotated ground truth and compared quantitatively and qual-
itatively. The model with the best performance can reach 88% sensitivity on test
data set. The combination of the classification and segmentation neural networks
can achieve an image-based crack detection method with high efficiency and accu-
racy. The algorithm can process any full image size as input. Compared with most
machine learning based crack detection algorithms using sub-image classification, a
relatively larger patch size is used in this paper and in this way the classification is
more robust and accurate. On the other hand, the negative area in the full image
will not be concerned in the segmentation procedure and this fact not only saves a
lot of computational power but also significantly increases the accuracy compared
to the segmentation performed on full images.
Keywords: Digital Twin, Crack Detection, Convolutional Neural Network, Com-
puter Vision, Image Classification, Semantic Segmentation.
v

Acknowledgements
It is a challenge for both of us to conduct a master thesis related to the topic of
computer vision and machine learning but we are so grateful that we got this chance
to learn and practice the most advanced field of structural health monitoring. We
want to express our most sincere appreciation to Kamyab Zandi, who has been su-
pervising our project for the the past half year and providing us advises to help
us carry out this study in a right way. We are grateful to Brosamverkan for the
financial support so that we can have a chance to have a drone and test our method.
Our colleague Henrik Waldäng collected photos of damaged concrete structures and
these are important source data for our neural networks. We also appreciate his
contribution.
Chalmers is an amazing place and both of us enjoyed the past two years of our
master program. We have met interesting colleagues and teachers and made new
friends. It’s a totally different life in Nordic countries and the pandemic of Covid-19
has made life more difficult. Even so this experience will still be one of the best
memories in our life. Thank you, Chalmers.
Xi Luo and Jia Guo, Gothenburg, June 2021
vii

Contents
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Background and Problem Description . . . . . . . . . . . . . . . . . . 1
1.2 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Approach and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Convolutional Neural Network for Structure Health Monitoring 5
2.1 Application of CNN in SHM field . . . . . . . . . . . . . . . . . . . . 5
2.2 The development of CNN . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 The architecture of CNN . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 The hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Improve the network performance . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Image Classification 17
3.1 Experiment Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Image Data set preparation . . . . . . . . . . . . . . . . . . . 17
3.1.2 The convolutional neural network . . . . . . . . . . . . . . . . 18
3.1.3 Hardware and software environment . . . . . . . . . . . . . . . 19
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Patch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Parametric study . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 The architecture modification . . . . . . . . . . . . . . . . . . 24
3.2.4 The data imbalance . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 The backbone of feature extractor . . . . . . . . . . . . . . . . 27
3.3.2 The classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
ix
Contents
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Image Segmentation 37
4.1 U-net architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Unbalanced segmentation and loss functions . . . . . . . . . . . . . . 42
4.4 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Training and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Experiment 51
6 Conclusion and prospect 55
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Prospect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Bibliography 57
x
List of Figures
1.1 Schematic diagram of Digital Twin for SHM . . . . . . . . . . . . . . 1
2.1 The pipeline of IPT . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The cell synaptic connections between cells, and the response of the
cells after completion of the self-organization . . . . . . . . . . . . . 7
2.3 The 3-Dimensional tensor for a RGB image . . . . . . . . . . . . . . 8
2.4 Visualization of convolutional process with 2*2 fliter . . . . . . . . . 8
2.5 Visualization of max-pooling process with 2*2 fliter and stride 1 . . . 9
2.6 The activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Visualization of drop out . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Image example from database . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 The convolutional neural network . . . . . . . . . . . . . . . . . . . . 18
3.4 The feature maps of each convolutional layer, patch size 64*64 . . . . 21
3.5 The feature maps of each convolutional layer, patch size 128*128 . . . 22
3.6 The feature maps of each convolutional layer, patch size 256*256 . . . 23
3.7 The convolutional neural network with residue module . . . . . . . . 25
3.8 The convolutional neural network with residue module . . . . . . . . 26
3.9 Different modules in Inception V3 network . . . . . . . . . . . . . . . 30
3.10 The fully connected classifier . . . . . . . . . . . . . . . . . . . . . . . 33
3.11 The random forest classifier . . . . . . . . . . . . . . . . . . . . . . . 33
3.12 The classification results . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.13 The classification results . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 U-net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 An example of UpConv layer with 2×2 kernels, stride 1 and same
padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Visualization of the Annotated Crack Image data-set . . . . . . . . . 41
4.4 An example of flip and rotation for images sized 128×128 . . . . . . . 41
4.5 An example of random cropping for images sized 227×227 . . . . . . 42
4.6 An example of true positive, true negative, false positive and false
negative predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Different groups of test images for evaluation . . . . . . . . . . . . . . 44
4.8 Training curves of models trained with different loss functions . . . . 45
4.9 Training progression of different models . . . . . . . . . . . . . . . . . 46
4.10 Comparison of Unet-4x32 models trained with different loss functions 48
xi
List of Figures
4.11 Comparison of different model architectures trained with DCL . . . . 49
5.1 Full image test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 DJI Mavic2 Zoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 The damaged wall and concrete specimens . . . . . . . . . . . . . . . 52
5.4 Images of the damaged wall and crack detection results . . . . . . . . 53
5.5 Images of the damaged specimen and crack detection results . . . . . 54
6.1 One example for applying object detection method to locate the
cracked area in concrete bridge . . . . . . . . . . . . . . . . . . . . . 56
xii
List of Tables
3.1 Specification of hardware and software environment . . . . . . . . . . 19
3.2 The mutual hyper-parameter for training . . . . . . . . . . . . . . . . 19
3.3 Comparison between different patch size . . . . . . . . . . . . . . . . 19
3.4 Comparison between different number of training images . . . . . . . 24
3.5 Comparison between different network . . . . . . . . . . . . . . . . . 25
3.6 Comparison between different residue module . . . . . . . . . . . . . 26
3.7 Comparison between different cracked and uncracked ratio . . . . . . 27
3.8 The network architecture of VGG16 . . . . . . . . . . . . . . . . . . . 28
3.9 The network architecture of Inception V3 . . . . . . . . . . . . . . . . 29
3.10 The network architecture of MobileNet . . . . . . . . . . . . . . . . . 31
3.11 The network architecture of DenseNet169 . . . . . . . . . . . . . . . . 32
3.12 Different feature extractor with fully connection layer . . . . . . . . . 33
3.13 Different feature extractor with random forest classifier . . . . . . . . 34
3.14 The parameter comparison . . . . . . . . . . . . . . . . . . . . . . . . 34
3.15 Different network under cost sensitive loss function . . . . . . . . . . 35
4.1 Layer Components of Each Unit in U-net . . . . . . . . . . . . . . . . 39
4.2 Sensitivity of different models . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Specificity of different models . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Computational costs of models . . . . . . . . . . . . . . . . . . . . . . 49
5.1 Specification of the Drone and Camera . . . . . . . . . . . . . . . . . 52
xiii
List of Tables
xiv
1
Introduction
1.1 Background and Problem Description
Structure Health Monitoring (SHM) refers to the discipline of regular inspection
or real-time monitoring of civil structures[1]. The information of the structure is
collected by human inspection or multiple kinds of sensors, based on which the safety
state of the structure will be assessed and potential risks can be recognized at a early
stage. In this way, a large amount of costs can be saved and more importantly, public
safety can be ensured.
The concept of Digital Twin (DT), which is first proposed by NASA in 2010, serves
as a digital representative of a physical object. It is a multi-physics and multi-
scale simulation that reflects the real-time state of the corresponding object, based
on both historical data and real-time sensor data.[2]. In the context of Structure
Health Monitoring, Digital Twin serves as a living digital model of the real structural,
providing structure performance predictions by processing varieties of data collected
by sensors[3].
Figure 1.1 illustrates the main procedures of Digital Twin for SHM. First, infor-
mation of the structure is collected by devises like a camera or embedded sensors.
Nowadays the thriving of Unmanned Aerial Vehicles (UAV) has provided a more
flexible method to inspect civil structures. These data, in the form of digital images
or signals form example , will be process and structural damages will be recognized
and located automatically. Finally the digital representative of the structural will
be updated with the detected damages and structural behaviour will be analyzed.
Figure 1.1: Schematic diagram of Digital Twin for SHM
1
1. Introduction
The application of DT in the SHM field provides a efficient and low-cost strategy.
However, integrating both time and space information, DT demands high level in-
formation analytic and computational power to support its up-to-date simulation.
Nowadays, with the fast development of information transmission technology like 5G
network and cloud computing, the concept of DT provides a grand new possibility
for SHM and arouse attention in the academia[4].
For concrete structures, cracks are a kind of typical signs of performance deteriora-
tion. Detection of cracks is a important part of the SHM for concrete structures.
Traditional methods of crack detection are usually done manually. Nowadays, non-
contact sensors like high-speed cameras and Unmanned Aerial Vehicles (UAV) are
widely used in the SHM field. Crack detection based on image data has been proved
to be a more efficient option [5]. Strategies like image processing method and ma-
chine learning algorithms have been developed for this purpose, among which Con-
volutional Neural Network (CNN) has recently become a popular class of Deep
Learning algorithms for image recognition problems[6].
1.2 Goals and Objectives
This thesis investigates the second procedure of the Digital Twin work flow shown
in Figure 1.1 and the topic is the application of Artificial Intelligence algorithm in
the SHM field. Specifically, the aim is to establish a well-functioning image-based
crack detection method using CNN models, which is expected to recognize cracks
in an image and locate damage on the structure surface. Moreover, the algorithm
should also be robust to noise data collected from rough concrete surface to reduce
the false identification rate.
The outcome of this algorithm will be a part of the creation of a Digital Twin of a
concrete structure, which will provide practical information that can be used in the
simulation of structure performance. Many further applications can also be built on
this highly autonomous crack detection procedure. In the long term, this project
can contribute to the promotion of the non-contact sensors and Digital Twin in the
SHM field. Regarding the aspect of social effect, these advanced technologies are
expected to provide a more cost-efficient and sustainable way to perform structure
maintains, which eventually benefits both the industry and ecology.
1.3 Scope and Limitation
This thesis focuses on the topic of concrete structure health monitoring and the
proposed algorithm is designed for the specific task of cracks detection for concrete
structure surfaces. Raw image data collected from both concrete structure speci-
mens and real concrete structural elements will be used for training and testing of
the proposed networks. For data-driven algorithms like CNN, the database plays an
important role in the network training and testing procedure. The raw data collec-
tion is beyond the scope of this thesis and existing databases established previously
in other projects will be used.
2
1. Introduction
The proposed image-based crack detection algorithm is supposed to serve as a part
of the Digital Twin program and the output of the algorithm will be reflected in
the digital twin simulation of the physical concrete structural. The function of
this algorithm is limited to identify and locate the crack in a image of a concrete
surface. Absolute accuracy is hard to achieve and the applications of CNN in SHM
community have never reached 100% of accuracy either [6]. The proposed models
in this thesis aim at a faster and fully autonomous algorithm with a acceptable
robustness and a better applicability in the field inspection setting.
1.4 Literature Review
In the SHM field, image based crack detection using CNN algorithm has been proved
to be a feasible method[6]. A literature review is conducted first to study the
advanced techniques for image-based crack detection using CNN.
Many researches has studied the topic of image-based crack detection as a sub-
image classification problem. Zhang et al. [7] proposed a ConvNet model trained
with more than 500 pavement images, which can recognize the presence of road
cracks in a square image patch sized 99 × 99 pixels. Using a similar strategy, Kim
et al. [8] proposed an transfer learning network based on pre-trained R-CNN to
detect cracks on concrete bridge. Sobel edge detection algorithm was also used in
that study to quantitatively measure cracks. Inspired by the famous CNN model
AlexNet for image classification, Kim and Cho establish a algorithm that can classify
sub-images sized 227×227 into 4 classes (cracks, structural joints, plants and intact
surface). The multi-class model and the large window size significantly increased
the identification accuracy and by using a overlap window the region of cracks is
narrowed. All the researches mentioned above used the sliding window method
which first divide a full image of concrete surface into sub-images and perform image
classification on the sub-images. Thus the sub-images with cracks will be selected
and the area of damage is obtained. But this method has a resolution of sub-image
level and provides no quantitative information about the cracks like the width and
the length. Further, when small sub-image size is used, meaning the amount of the
input data is smaller, the accuracy and robustness of CNN classifiers also decrease.
Another direction of image-based crack detection is image segmentation, which can
reach a pixel-level resolution. Lin et al. [9] proposed a CNN model based on the
famous U-net architecture to perform full-image segmentation. Attention gate tech-
nique is employed in the model to improve the performance on small object detection
task. Zhang et al. [10] investigated different architectures based on U-net and com-
pare their performance on 256× 256 sub-images, and generalized loss functions are
used in the training of the models. The image segmentation strategy can reach bet-
ter resolution of the target cracks then the sub-image classification, but it also costs
considerably more computational power to finish the task, especially for full image
processing. Another problem is that cracks are usually extremely small object in
a full image of a concrete surface. For this type of small target segmentation, the
training of CNN models becomes unstable.
3
1. Introduction
1.5 Approach and Outline
This thesis proposes a combination of image classification and segmentation CNN
models to reach a algorithm with high accuracy and efficiency. First, full images
are divided into patches and classified as positive and negative. Then, those sub-
images classified as positive where cracks are visible are further processed by the
image segmentation procedure to obtain the pixel level shapes of the cracks. The
combined algorithm has advantages in many aspect. Firstly, it can process any full
image size as input, which means there is no limitation for data collection devices.
Secondly, compared with most machine learning based crack detection algorithms
using sub-image classification, a relatively larger patch size can be used and the
classification is more robust and accurate. Lastly, the negative area in the full
image will not be concerned in the segmentation procedure and this fact not only
saves a lot of computational power but also fix the small target segmentation issue
and significantly increases the accuracy compared to the segmentation performed
on full images. This thesis consists of six chapters
1. Introduction: A brief overview of this thesis, including the background of the
SHM and Digital Twin concepts. Introduce the problem needs to be solved,
the objectives, scopes, relevant researches and approaches of this master thesis
project.
2. Convolutional Neural Network for Structure Health Monitoring:
Present the basic theory of CNN algorithm and it’s applications in the SHM
field. Several advanced techniques that has been developed to optimize CNN
algorithms for computer vision problems are also introduced in this chapter.
3. Image Classification: This chapter records all the hyper-parameter adjust-
ing process and the techniques has been used for getting a better accuracy.
Based on the final prediction results, one best model will be chosen and used
in the algorithm.
4. Image Segmentation: CNN models based on the elegant U-net architecture
for image segmentation tasks are built and trained with pixel-level annotated
crack sub-images. The model are evaluated quantitatively and qualitatively.
5. Experiment: Photos of cracks are taken by a drone in Chalmers campus,
from a damage wall in the structure lab and a concrete specimen. The photo
are used to test the proposed algorithm and the crack detection result is illus-
trated in this chapter.
6. Conclusion: Presents a discussion on the performance of proposed model and
a summary of this thesis, including the conclusion of this study and possible
improvements. The possible application of the outcome and the prospect of
the future study is also discussed.
4
2
Convolutional Neural Network for
Structure Health Monitoring
2.1 Application of CNN in SHM field
Cracks on the concrete surface are the early sign of concrete degradation, as they can
accelerate the corrosion process and propagate to severer concrete defects in the later
stage. Therefore, early crack detection has always been an essential task in structural
health monitoring and ritual observation object on an inspection basis. Nowadays,
robots or unmanned-aerial vehicles can partial the collecting process, leaves only the
image-based crack detection work for the inspector. However, manually selecting the
crack in the concrete image is very time-consuming and labor-intensive, which will
exaggerate when the image database becomes large. Image enhancement technique
can highlight the cracked parts to ease the crack selecting process difficulty, thus
always show up as data post-processing step in image processing steps. The tradi-
tional image processing technique, known as the automated crack detection process
based on various edge detecting algorithms, can identify the cracked parts within
the concrete images without anthropogenic intervention. However, it still needs a
final check about the processed results. Unlike the traditional image processing
technique, the deep learning method can predict the results without review, just
with some labeled data for full training, which has become the hot trend of this
time.
The image processing technique (IPT) in the crack detection field generally indicates
the automated crack prediction process merely done by a mathematical algorithm.
The flowchart 2.1 below shows the work mechanism pipeline. Various edge detector
works exceptionally when the concrete image contains not too much noise at edge
detection step. The next step is to apply some image enhancement techniques to
increase the clarity of the edge feature. The final step is to extract the cracked
features by image segmentation techniques such as binarization or thresholding.
Based on a study done by Abdel-Qader et al.[11] who systemically studied the
Fast Haar transform, Fourier transforms, Sobel filter, and Canny filter, concluded
that conclusion that the fast Haar transform has the top performance with 86%
accuracy, followed with Canny filter. The following study involved different religion-
based crack detection techniques like support vector machine and the penetration
model developed in 2008 by Yamaguchi et al.[12] Nashikawa et al.[13] fought a way
to quantify the crack width by the illuminance distribution in one image in 2012.
However, the traditional IPTs have the mutual drawback of lacking the resilience
5
2. Convolutional Neural Network for Structure Health Monitoring
for image noise like the clarity and lighting condition. Lecun et al.[14] who applying
the convolutional neuronal network to classify crack image brought an innovative
way of crossing the hurdle traditional IPT remains. Sattar Dorafshan, Robert J
et al. conducted a study in 2018 about comparing four edge detection methods in
the frequency domain (Roberts, Prewitt, Sobel, and Laplacian of Gaussian) and
two in the frequency domain (Butterworth and Gaussian), along with the AlexNet-
based DCNN in different training condition.[15] The author claim that DCNN in
the transfer learning model performs better than the IPT with higher sensitivity for
the thinner crack from 0.1mm down to 0.04mm and the accuracy from 53-79% to
86% in a shorter computational time for the prediction process itself. By proposing
a hybrid detection model at the end of this paper, this study revealed the promising
future application of CNN in the crack detection field.
Figure 2.1: The pipeline of IPT
2.2 The development of CNN
The history of the convolutional neural network (CNN) development has three dif-
ferent stages. The first stage is the invention of CNN. After the inspiration from a
biology experiment based on the visual cortex cell of cats, Hubel and Wiese devel-
oped the concept receptive field to describe the hierarchy transmission of stimuli in
the Visual system in the 1960s [16]. This biology observation begot the incipient
CNN called Neocognitron and coined by Fukushima in the 1980s [17]. As a hierar-
chical network like the biology cortex system, the Neocognitron has the properties
that once done the training, the recognition process will not be affected by a slight
deformation of input data or spatial change (Figure 2.2). The novelty of this re-
search is the innovative imitation of how the information passing between cells or
neurons. Back at that time, the unsupervised learning method was the choice. Al-
though the Neocognitron can manage to train the higher layer forward, this learning
process will strongly depend on the training pattern input, which will become the
main reason for upgrading difficulty when the recognition task becomes complex.
LeCun, et el.[18] who applied a backpropagation algorithm improve the deficiency
of the previous work. They trained the network known as LeNet-5 by using the
convolutional filter to extract the features in the original image, then proceeding it
forward, which has outstanding performance. Success in recognizing the handwritten
digit by LeNet-5 aroused the intention of academia. Despite the improvement on
training algorithm, the problems of deep connection such as vanishing gradient,
overfitting still are challenges to be resolved.
6
2. Convolutional Neural Network for Structure Health Monitoring
Figure 2.2: The cell synaptic connections between cells, and the response of the
cells after completion of the self-organization
The fast growth of artificial neural network techniques over the past decades fa-
cilitates CNN development. For instance, Xavier, Yoshua, et al. brought up the
normalized initialization scheme to counter the vanishing gradient issue.[19] With
the image classification contest held in 2012, the champion Krizhevsky proposed the
well-known AlexNet[20], achieved a test error rate of 15.3 %, which is half less than
the runner-up, brought the field further heat. After AlexNet, the CNN research
field was blooming. For instance, the development of VGG-Net proposed by Visual
Geometry Group, GoogLeNet from Google, drove CNN to become more and more
sophisticated for industry applications.
2.2.1 The architecture of CNN
2.2.1.1 The input layer
The input layer can represent the pixel value for one RGB image, often shows as
a multidimensional tensor. If the network processes the RGB images have three
channels, the input layer is a tensor with four dimensions. Generally, the first
dimension indicates the image number, then the rest of the dimensions are the
image width, image height, and image depth. After the input layer, the image will
be pass to the overall convolutional neural network.
7
2. Convolutional Neural Network for Structure Health Monitoring
Figure 2.3: The 3-Dimensional tensor for a RGB image
2.2.1.2 The convolutional layer
The convolutional layer plays the most pivot part in the overall network by convolv-
ing the input layer then devolving it to the next layer. The convolutional filters,
which have the learnable weights for detecting the features to represent the original
image, did the actual convolution operation. The hyperparameter will also include
stride and padding size. Stride is the number of pixels that each time the filter leaps,
padding size is the complementary pixel added in a specified way for purposes like
confining output range.
In most situations, the non-linear activate function will follow the convolutional
layer to empower the network to learn non-linear characteristics.
Figure 2.4: Visualization of convolutional process with 2*2 fliter
2.2.1.3 The pooling layer
The pooling layer performs the sampling to reduce the output size. The pooling layer
only keeps the maximum value among pixels is called the max-pooling layer, whereas
it will keep the average pixel value in the average-pooling layer. In most cases, the
pooling layer can help to ease the cramming data flow and prevent overfitting.
8
2. Convolutional Neural Network for Structure Health Monitoring
Figure 2.5: Visualization of max-pooling process with 2*2 fliter and stride 1
2.2.1.4 The fully connection layer
The fully connected layer is a typical multi-layer perceptron neural network that
will connect the flattened convolutional layer to the final output class. Generally,
the last fully connected layer is the output layer to make the final prediction: the
probability of each target class or the actual prediction value.
2.2.2 The hyper-parameter
2.2.2.1 The loss function
The loss function, also known as the cost function, plays a vital role in the training
process since the overall goal is to reduce the loss function. The loss will be the
key value for updating the learnable parameters in networks like weights and bias.
Generally, the customized loss function is more popular for the actual task. Based
on three different task categories, this paper will list the most commonly used loss
function below.
1. The Regression Loss
• The Mean Squared Loss
The mean squared loss is also known as L2 loss, calculated by the mean
squared error between the truth and predicted value. This loss function
will always give a positive value where zero means the perfect prediction.
MSE = (y − f(x))2 (2.1)
• The Mean Absolute Loss
The mean absolute loss is the distance between the actual value and the
predicted value. Be Known as L1 loss, which is more robust to resist the
outliers compare with L2 loss.
MAE = |y − f(x)| (2.2)
• The Huber Loss
The Huber loss is a step function that synthesizes the benefits of both
MSE and MAE. Thresholding the loss value by a factor δ, the loss smaller
9
2. Convolutional Neural Network for Structure Health Monitoring
than δ will be calculated as MSE otherwise MAE.1(y − f(x))2 if |y − f(x)| ≤ δ
L 2δ =  (2.3)δ|y − f(x)| − 1δ2 otherwise
2
2. The Binary Classification Loss Function
• The Binary Cross Entropy Loss
The term entropy describes the uncertainty of one system or state where
higher entropy usually∫indicates higher randomness. − ∑p(x) log p(x)dx if x is continuousS =  (2.4)− p(x) log p(x) if x is discrete
x
If the prediction is binomial, the entropy function will turn into:
L = −y ∗ log (p)− (1− y) ∗ log (1− p) (2.5)
If the binomial distribution is also mutex, means y can only be 0 or 1,
will generate the definition of cross-entropy:
{
− log (1− p) if y = 0
L = (2.6)
− log (p) if y = 1
• The Hinge loss
Hinge loss is suitable for the support vector machine, where the label y
can only be -1 and 1 so that if the predicted value has low confidence to
be the ground truth will also be penalized.
L = max(0, 1− y ∗ f(x)) (2.7)
3. The Multi-class Classification Loss Function
• Multi-Class Cross Entropy Loss
When the classification has multiple targets, the cross-entropy loss needs
expansion. With one-hot labeled vector as target yi, one can write the
loss function as: ∑c
L(xi, yi) = − yij ∗ log pij (2.8)
j=1
where yi is a one-hot encoded vector, and only when ith element in class
j, yij = 1, otherwise is zero:
yi = (yi2, yi2, ..., yic) (2.9)
10
2. Convolutional Neural Network for Structure Health Monitoring
• The KL-Divergence
The KL-Divergence is a mathematical expression for evaluating two dif-
ferent distribution groups. The KL-Divergence equal to zero when two
distribution is identical.
DKL(P∥∫Q) − ( ) log Q(x)
∫ P (x)
p x dx = p(x) log dx if the distribution is continuous
 ∑ P (x) ∑ Q(x)=  − p(x) log Q(x) = p(x)p(x) log if the distribution is discrete
x P (x) x q(x)
(2.10)
2.2.2.2 The optimizing algorithm
The optimizing algorithm defines the network learning process. The development of
the optimizing algorithm went through several stages, which can be represented by
different optimizes.
1. Gradient Descent
Gradient descent is the most fundamental updating algorithm across all types
of machine learning models. By subtracting the previous weights with the
derivative of itself in the loss function, the network updates after each training
epoch toward the loss-dropping direction.
w′ij = wij − η∇L(w)ij (2.11)
Where η is the learning rate.
2. Stochastic gradient descent
The stochastic gradient descent optimizing method is the gradient descent
method with little reform. It updates the weights by using the loss over mini-
batch input instead of the whole dataset.
w′ij = wij − η∇Lbatch(w)ij (2.12)
In the later section, the notation gt will be adapted to represent factor∇Lbatch(w)t.
3. Stochastic gradient descent with momentum
Both previous methods only use the weight gradient to train the network
and might lead the training process to impasse if the loss landscape becomes
"flat" as there has no weight gradient for the deduction. This phenomenon is
also known as the saddle point where some dimension in loss landscape has
minimum but not the global minimum place. The stochastic gradient descent
with momentum was ca-med up to resolve this insufficiency. The "momentum"
is the gradient of the weights accumulated in previous updating steps, pushes
the weights update so that the updating process can rollover the zero weight
gradient area, and appeared in the latter advanced optimizing function.
vt = (1− γ)gt + γ∇Lbatch(w)ij (2.13)
11
2. Convolutional Neural Network for Structure Health Monitoring
w′ij = wij − ηvnew (2.14)
where gt is the momentum function and γ is the hyper-parameter satisfying
0 ≤ γ ≤ 1.
4. Adaptive learning rate
Adagrad is the optimizer with an adaptive learning rate and has the math-
ematical expression below. The learning rate becomes smaller if the weight
gradient gets small, which is more reasonable, as it considered the updating
of the weights based on the current weight state.
st = st−1 + gt ⊙ gt (2.15)
′ = − √ ηwij wij gt (2.16)st + ϵ
The ϵ is a dummy factor that prevents the dominator is zero and often be
10−6, the s0 will be initialized as zero, and the symbol ⊙ means element
multiplication.
RMSprop is the optimizer very similar to Adagrad. Since in Adagrad, the
factor st will keep increase, the Aagrad optimizer will have inefficient learning
in the later stage. Instead of keeping to add up, the RMSprop optimizer did
nothing but balanced the term st between the previous and current state.
st = γst−1 + (1− γ)gt ⊙ gt (2.17)
w′ij =
η
wij − √ gt (2.18)
st + ϵ
Where γ is the hyper-parameter satisfying 0 ≤ γ1 ≤ 1.
Further substitution done by Adadelta optimizer was replacing the constant
learning rate η to the dynamic factor △x associated with weights:
st = γ1st−1 +√(1− γ1)gt ⊙ gt (2.19)
△xt−1 + ϵ
w′ij = wij − gt (2.20)st + ϵ
△xt = γ2△xt−1 + (1− γ2)gt ⊙ gt (2.21)
where: The factor ϵ aims to stabilize the calculation. In most cases is very
small such as 10−6.
The Adam optimizer kept the st and △xt. once the hyper-parameter γis fixed
such as 0.9 and both x and s have zero initialization, those two factors will
have a small value in the beginning as the only contributing term will multiply
by 0.1. In order to ease the effect, Adam optimizer has re-normalized the st
and △xt:
= stŝt (2.22)1− γ1t
12
2. Convolutional Neural Network for Structure Health Monitoring
△̂xt =
△xt (2.23)
1− γ2t
′ = − √η△x̂twij wij gt (2.24)ŝt + ϵ
2.2.2.3 The activation function
1. The sigmoid function
The sigmoid function is the mathematical function that has an "S" shape.
For example, it can be the logistic function, the hyperbolic function, the Arc-
tangent function, etc. It appears to be the first generation of activation func-
tion in deep learning. The S shape endows the sigmoid function relatively high
speed of changing in the middle section comparing to the two polar extremes.
• Logistic function
Logistic function takes the formula σ(x) = 11+ −x and output the value ate
[0,1], it has been pervasively used since it can concentrate the large real
output to the interval. However, the shape of the sigmoid function has
saturation at two polar extremes, which will slow the weight updating
process down once the output after the layer gets small, then produces
the vanishing gradient problem.
• Hyperbolic tangent
The hyperbolic tangent function has a very high resemblance with the
logistic function but confines output to [-1,1] and centralizes the result
to 0. It takes form 2xf(x) = e −1e2x+1 .
2. Rectified Linear Unit (ReLU)
The rectified linear unit function can solve the vanishing gradient problem. Its
activated output range is [0,1]. The mathematical form for ReLu activation
has the form:
σ(x) = max(0, x) (2.25)
A lot of research has pointed out that the ReLU function can improve training
effectiveness. However, the network will not pronounce the negative output,
which might cause training insufficiency as well.[21]
13
2. Convolutional Neural Network for Structure Health Monitoring
The activation function
6
4
2
0 Logistic function
Hyperbolic tangent
ReLu
−6 −4 −2 0 2 4 6
x
Figure 2.6: The activation function
2.3 Improve the network performance
2.3.1 Data Augmentation
Data augmentation is one data pre-processing technique that can enlarge the image
database. Common was seen as flip, rotate, crop the original image. Data augmen-
tation has proven benefits for improving performance. For example, it can generalize
the network adaptability and prevent over-fitting.
2.3.2 Pre-processing
Data pre-processing includes re-scaling, standardization, normalization, etc. It can
help the network learn effectively.
2.3.2.1 Re-scaling/Normalization
In general, the RGB image has 8 bits, which makes the pixel value range from 0
to 255. Large average and variation will decrease the training efficiency. Therefore,
re-scaling the image to 0-1 is the first data prepossessing step for preparing the
training data.
= X −XminX (2.26)
Xmax −Xmin
Where X is the pixel value in image.
If one only applies the normalization on the original image, since Xmax = 255 and
Xmin = 0. Therefore, the formula has the form:
X = X/255.0 (2.27)
14
σ(x)
2. Convolutional Neural Network for Structure Health Monitoring
2.3.2.2 Standardization/Image whiting
The standardization process makes the standard deviation to 1 and the means value
to 0, also known as the image whiting technique.
X = X − µ (2.28)
σ
2.3.3 Batch normalization
The batch normalization layer appears before non-linearity activation in general
cases. The purpose of batch normalization is to accelerate the training process to
reach the global minimum of loss. By adding the batch normalization layer, the
mean value and standard variation can be 0 and 1, respectively.
The underpinning of the optimizing effect has many explanations. The most widely
accepted one was reducing the internal con-variance shift, which means the neg-
ative impact of continuous weights updating proceeding between layers.[22] How-
ever, the study done by Shibani Santurkar et al.[23] has pointed out the posi-
tive effect attributed to the smoothing loss landscape during the training process
mostly. The "smoothing effect" is favorable for the non-convex optimizing process
by making the loss decreasing slowly with smaller weights and providing better
"β-smoothness"[24]to the gradients of the loss. The author also pointed out, this
smoothing effect does not bond to the batch normalization layer. It can get from
other regularization methods too. [23]
2.3.4 Regularization
Regularization is the method used in deep learning to prevent over-fitting. It gives
the modified loss function an additional penalty term(also be known as regular-
ization term) so that the weight can be limited in a small range and leads to an
uncomplicated model with a lower risk of over-fit training. The penalty term often
has two type:
L1 regularization:
α
L̂(w) = L(w) + ∗ ∥w∥ (2.29)
2
∇wL̂(w) = αsign(w) +∇wL(w) (2.30)
wnew = wold − ϵ(αsign(wold) +∇wL(wold)) (2.31)
L2 regularization:
α
L̂(w) = L(w) + ∗ ∥w∥2 (2.32)
2
∇wL̂(w) = αw +∇wL(w) (2.33)
wnew = wold − ϵ(αwold +∇wL(wold)) (2.34)
= (1− ϵα)wold − ϵ∇wL(wold)
Where α is the penalty factor and ϵ is the learning rate.
15
2. Convolutional Neural Network for Structure Health Monitoring
The L1 and L2 regularization subtract the absolute value of the weights matrix and
the squared sum of the weight matrix, respectively. Therefore, L1 regularization can
give a more sparse solution since the weights can be zero with a rate of -1 or 1.
2.3.5 Dropout
Dropout is a strategy for preventing over-fitting when the network becomes deep
and complicated. It makes the network drop some neurons during each training
iteration randomly with predefined probability p, which will force the remaining
neurons to learn more effectively. However, one thing worth noticing is that due
to the randomness of the turn-off neuron, it is unlikely to get the same inference
prediction results.
Figure 2.7: Visualization of drop out
16
3
Image Classification
This chapter aims at developing a proper convolutional neural network that can
classify the input image accurately. In order to get the model with the best perfor-
mance, multiple trials executed by altering different influential factors for the CNN
will show up in the first section of this chapter. After this section, the best perfor-
mance model will be compared with the transfer learning model which connects with
different classifiers. Finally, the prediction results from both the transfer learning
model and the fully-trained model are presented in results section.
3.1 Experiment Set-up
3.1.1 Image Data set preparation
The data-set used in this project is 77 camera-token RGB images. It contains various
backgrounds including concrete scratches, stains, occlusion, wall edges, infrastruc-
ture ambient objects for example the grass and stone. The dimension for each image
data is [1836,3264,3].
Figure 3.1: Image example from database
17
3. Image Classification
As an image processing technique, data augmentation can enlarge the database with
limited image data, prevent over-fitting and favor the network generalization ability.
Rotating with a different angle, shearing, random brightness adjusting, and adding
Gaussian noise has been applied in the experiment, chose which augmentation
method was random.
Figure 3.2: Data augmentation
3.1.2 The convolutional neural network
The convolutional neural network used in the experiment has the architecture below:
Figure 3.3: The convolutional neural network
18
3. Image Classification
3.1.3 Hardware and software environment
The models are training on Google Colab platform. The specification of the hard-
ware and software environment is illustrated in Table 3.1
Table 3.1: Specification of hardware and software environment
Platform Google colab
CPU Intel(R) Xeon(R) CPU @ 2.20GHz
GPU NVIDIA Tesla P100-PCIE (16GB)
Python 3.7.10
TensorFlow, 2.4.1;
Packages OpenCV2, 4.1.2;Numpy, 1.19.5;
Matplotlib, 3.2.2
3.2 Implementation
3.2.1 Patch size
The sub-image size has a substantial effect on the network performance, as it will
define how many pixels one image subject will contain. In this project, crack width
generally propagates within the range of 8-16 pixels. But there also has a relatively
high number of cracks that take more than 32 pixels. The sub-image size will alter
the relative size between the crack feature and the un-cracked background, which
might end up reflecting on the final classification results. Therefore, the investigation
should be carried out about setting the appropriate sub-image size.
In order to carry out the single variable experiment, except for the patch size, which
is 64, 128, and 256 respectively, the rest of the parameters will all be kept the same.
The total number of sub-images for each group is 8K, which contains half of the
cracked images and the other half of the uncracked images. The train-validate-ratio
keeps as 4:1, one-tenth of the total image was used for final evaluating.
Table 3.2: The mutual hyper-parameter for training
Optimizer Learning rate Epoch Batch size
Adam 1e-05 1000 200
Table 3.3: Comparison between different patch size
Patch size Train acc(%) Val acc(%) Test acc(%) Time(epoch/s)
64*64*3 89.81 84.51 83.75 1
128*128*3 90.74 84.24 84.62 2
256*256*3 96.88 92.50 89.63 7
19
3. Image Classification
As we can see from Table 3.3, the classification results are getting better when the
patch size is increasing. In order to understand the results better, the feature map
after each convolutional layer has been plotted for each patch size.
From Figure 3.4,3.5,3.6 below, we can see the first two convolutional layers can
extract shallow features of the crack. When the convolutional layer gets deeper, a
higher dimension of the features can show up after convolution. This phenomenon
can explain why big patch size gives better prediction results in an intuitional way.
A bigger patch size is favorable to the deep convolutional layers to extract more
complex information about the crack features by providing a wider pixels range. On
the other hand, a small patch size has no big scale of the information to give.
20
3. Image Classification
(a) Input image and the visualized output of each layer
(b) 1st Conv layer (c) 2nd Conv layer
(d) 3rd Conv layer
(e) 4th Conv layer
Figure 3.4: The feature maps of each convolutional layer, patch size 64*64
21
3. Image Classification
(a) Input image and the visualized output of each layer
(b) 1st Conv layer (c) 2nd Conv layer
(d) 3rd Conv layer
(e) 4th Conv layer
Figure 3.5: The feature maps of each convolutional layer, patch size 128*128
22
3. Image Classification
(a) Input image and the visualized output of each layer
(b) 1st Conv layer (c) 2nd Conv layer
(d) 3rd Conv layer
(e) 4th Conv layer
Figure 3.6: The feature maps of each convolutional layer, patch size 256*256
23
3. Image Classification
3.2.2 Parametric study
Database scale always is one of the most influential factors that can determine
the final prediction results. There have a lot of studies investigated how big the
influence of the database size can generate. In the parametric study, how the data
constitution ratio and total training image number change the network performance
will be discussed.
The scale study will compare the network performance under various total training
image numbers. Except for the training image number, all the rest of the parameters
will remain the same as Table 3.2 listed.
Table 3.4: Comparison between different number of training images
Image number Train acc(%) Val acc(%) Test acc(%) Time(epoch/s)
8000 91.50 89.00 89.05 2
10000 94.73 89.45 90.40 2
20000 94.03 92.11 92.15 4
30000 93.87 92.26 92.67 6
40000 93.86 93.28 93.52 8
The testing results suggest large database size has a positive effect on the train-
ing process. However, the most significant improvement is increasing the database
from 10000 to 20000. Keep enlarging the training image number will improve the
accuracy, but the training efficiency is getting down as the increasing rate gets slow.
Considering the overall training efficiency, the total sub-image number for training
will be 30K for the following experiment.
3.2.3 The architecture modification
3.2.3.1 The filter size
The original network has four convolutional layers with the same filter size. Based on
the study did by Cha et al.,[25] the CNN architecture has several convolutional layers
with different filter sizes, which performs exceptionally with sub-image dimension
256*256*3. For comparison purpose, network two modified with inspiration from
the CNN built by Cha et al.[25] As Figure 3.7 shows, the modified network still has
four convolutional layers. However, the convolutional filter size has replaced from
universal 3*3 to 20*20,15*15,10*10 and 1*1 respectively. Moreover, the network has
moved the batch normalization layer ahead and only uses the activation layer once.
It is not hard to see the network uses a relatively wide stride size in the first three
convolutional operations to reduce the dimension of the feature map.
24
3. Image Classification
Figure 3.7: The convolutional neural network with residue module
Table 3.5: Comparison between different network
Network Train acc(%) Val acc(%) Test acc(%) Time(epoch/s)
Original 93.87 92.26 92.67 6
Modified 97.09 88.83 89.43 5
As Table 3.5 shows, the modified CNN has lower accuracy than the previous one.
It is accredited to the stride size and potentially causes less convolutional layer.
3.2.3.2 Residue module
The residue module was first been proposed by Kaiming He, Xiangyu Zhang et al.
in 2015.[26] Residue module is designed for training the deep neural network. The
driven design motivation is predicated on the hypothesis that it is better for the
network to find optimal solutions from the previous optimized layers than a stack of
nonlinear layers.[26] By adding shortcuts for the deep layers, the identity layer from
the previous layer will be utilized for helping the network find the optimal solution
efficiently.
In this section, the strategy is to add the residue block to deepen the previous CNN
architecture. The number of residue models is the single variable in the following
experiment.
25
3. Image Classification
Figure 3.8: The convolutional neural network with residue module
Table 3.6: Comparison between different residue module
Num. Residue module Train acc(%) Val acc(%) Test acc(%) Time(epoch/s)
0 93.87 92.26 92.67 6
1 96.04 93.17 92.93 7
2 98.49 96.35 95.87 8
3 98.31 94.13 95.93 9
As we can see from Table 3.6 above, the network has the best performance when
it has three residue blocks. But significant improvement hasn’t been observed from
increasing residue block two to residue block three, which suggests the limitation of
gaining accuracy only by adding more residue block. The network architecture used
in the rest of the experiment is the original network with three residue blocks.
3.2.4 The data imbalance
Data imbalance is a problem of data constitution when the data-set has a large
unbalanced ratio for different data categories.
Since the un-cracked background is the dominating category of camera token photos
in this project, the data augmentation technique was applied to ensure the cracked-
uncracked ratio. Switching different train ratios can inevitably alter the classification
results.
It is worth mentioning that when changing the cracked to uncracked ratio, the total
image number will remain the same. Moreover, the testing data will change to five
excluded full-scale images instead of the one-tenth image from the training dataset.
26
3. Image Classification
The motivation of changing testing data is to differentiate the data composition
from training data to fully address the effect caused by the cracked and uncracked
ratio. The model that has the highest validation accuracy during training will keep
the same for the final test.
Table 3.7: Comparison between different cracked and uncracked ratio
Cracked-Uncracked ratio Accuracy(%) Recall(%) Precision(%) F1 Score
1:1 95.27 58.27 62.18 60.16
1:3 95.51 33.86 78.18 47.25
1:5 96.30 48.03 82.43 60.69
1:7 96.59 67.72 74.78 71.07
Table 3.7 suggests the best crack-uncracked ratio is 1:7. Although the results might
be counter-intuitive at first, there has a non-negligible reason behind it. Since the
total sub-image is unchanged as 30K, once the cracked image number decreased, the
uncracked image number will inevitably increase. And the uncracked image in the
training database has around 22K in total, which means if one increases the cracked
parts, potentially cut-off some important uncracked feature input and decrease the
accuracy in return. Another possible explanation is that when the uncracked portion
increased, the training data constitution will approach the testing data. Therefore,
the cracked-uncracked ratio for consisting training data is 1:7 for the rest of the
experiment.
3.3 Transfer Learning
Transfer learning is a technique that utilizes the model trained on a large dataset to
classify new data sets. However, the weights contained in the pre-trained model will
not update since freezing pre-trained layers is the first step in the training process.
One variation of transfer learning named fine-tuning chooses few layers not to be
frozen but trained by the provided data. The transfer learning method is prevalent in
the image classification field due to its convenience and capability of high accuracy.
3.3.1 The backbone of feature extractor
In this section, four pre-trained models will be the feature extractor, truncating the
fully connected layers in the original model then replacing them with the classifier
defined in the next section. The feature extraction means the convolutional process
what done by multiple convolutional layers.
The pre-trained models are VGG16, Inception V3, MobileNet, and Densenet in
Keras, which all include no classifier since the classifier will be custom-ed in the
next section.
27
3. Image Classification
3.3.1.1 VGG16
VGG is the abbreviation of Visual Geometry Group, which designed this network
to find how the depth of the network will affect the final network performance in
2014.[27] The noticeable difference with AlexNet is that VGG16 adopts the big con-
volutional filter size in AlexNet with continuous and unified size 3*3 filter size, which
can reduce the training parameter for each layer to compensate for the increasing
depth. This network design concept is also known as factorizing convolution. Table
3.8 below lists the VGG16 network architecture.
Table 3.8: The network architecture of VGG16
Layers Output size VGG16
Input 128× 128× 3
Convolution 128× 128× 64 3× 3× 64 conv
Convolution 128× 128× 64 3× 3× 64 conv, stride 2
Max Pooling 64× 64× 64 2× 2× 32 max pooling
Convolution 64× 64× 64 3× 3× 64 conv
Convolution 64× 64× 128 3× 3× 128 conv
Max Pooling 32× 32× 128 2× 2× 128 max pooling
Convolution 32× 32× 256 3× 3× 256 conv
Convolution 32× 32× 256 3× 3× 256 conv
Convolution 32× 32× 256 3× 3× 256 conv
Max Pooling 16× 16× 256 2× 2× 256 max pooling
Convolution 16× 16× 512 3× 3× 512 conv
Convolution 16× 16× 512 3× 3× 512 conv
Convolution 16× 16× 512 3× 3× 512 conv
Max Pooling 8× 8× 512 2× 2× 512 max pooling
Convolution 8× 8× 512 3× 3× 512 conv
Convolution 8× 8× 512 3× 3× 512 conv
Convolution 8× 8× 512 3× 3× 512 conv
Max Pooling 4× 4× 512 2× 2× 512 max pooling
Classification 4096 4× 4× 512× 4096 fully connected
layer 1000 4096× 1000 fully connectedsotmax
3.3.1.2 Inception V3
The inception V3 model has five different convolution module, which is all contains
multi-scale convolution process. The difference between Inception version three
and the previous version is that the added auxiliary classification part was just for
regularization purposes.
28
3. Image Classification
Table 3.9: The network architecture of Inception V3
Layers Output size Inception V3
Input 128× 128× 3
Convolution 63× 63× 32 3× 3× 3× 32 conv
Convolution 61× 61× 32 3× 3× 32× 32 conv
Convolution 61× 61× 64 3× 3× 32× 64 conv
Max Pooling 30× 30× 64 2× 2× 64× 64 conv
Convolution 30× 30× 80 1× 1× 64× 80 conv
Convolution 28× 28× 192 3× 3× 80× 192 conv
Max Pooling 13× 13× 192 2× 2× 192× 192 max pooling
Inception A ×3 13× 13× 288 module processing
Inception B ×1 6× 6× 768 module processing
Inception C ×4 6× 6× 768 module processing
Inception D ×1 2× 2× 1280 module processing
Inception E ×2 2× 2× 2048 module processing
Average Pooling 2048 2× 2× 2048 global average pooling
Classification 1000 2048× 1000 fully connected
layer softmax
(a) Inception module A layer
(b) Inception module B
29
3. Image Classification
(a) Inception module C
(b) Inception module D
(c) Inception module E
Figure 3.9: Different modules in Inception V3 network
3.3.1.3 MobileNet
Coined by Howard and Andrew G et al. at 2017[28], the MobileNet model applying
the depth-wise separable convolution in the network design and has 28 layers in
total. The so-called "depth-wise separable convolution work" includes two individual
convolutional operations with different purposes. And the work is achieved by the
depth-wise and point-wise convolutional layers. The former can filter the input
30
3. Image Classification
channel firstly then the latter can fuse the information. This elegant convolution
design can significantly reduce the number of parameters needed in the traditional
convolutional network.
Table 3.10: The network architecture of MobileNet
Layers Output size MobileNet
Input 128× 128× 3
Convolution 64× 64× 32 3× 3× 3× 32 conv, stride 2
Convolution dw 64× 64× 32 3× 3× 32 conv, stride 1
Convolution pw 64× 64× 64 1× 1× 32× 64 conv, stride 1
Convolution dw 32× 32× 64 3× 3× 64 conv, stride 2
Convolution pw 32× 32× 128 1× 1× 64× 128 conv, stride 1
Convolution dw 32× 32× 128 3× 3× 128 conv, stride 1
Convolution pw 32× 32× 128 1× 1× 128× 128 conv, stride 1
Convolution dw 16× 16× 128 3× 3× 128 conv, stride 2
Convolution pw 16× 16× 256 1× 1× 128× 256 conv, stride 1
Convolution dw 16× 16× 256 3× 3× 256 conv, stride 1
Convolution pw 16× 16× 256 1× 1× 256× 256 conv, stride 1
Convolution dw 8× 8× 256 3× 3× 256 conv, stride 2
Convolution pw 8× 8× 512 1× 1× 256× 512 conv, stride 1
5×Convolution dw 8× 8× 512 3× 3× 512 conv, stride 1Convolution pw 8× 8× 512 1× 1× 512× 512 conv, stride 1
Convolution dw 4× 4× 512 3× 3× 512 conv, stride 2
Convolution pw 4× 4× 1024 1× 1× 512× 1024 conv, stride 1
Convolution dw 4× 4× 1024 3× 3× 1024 conv, stride 2
Convolution pw 4× 4× 1024 1× 1× 1024× 1024 conv, stride 1
Average Pooling 1024 4× 4× 1024 global average pooling, stride 1
Classification 1000 1024× 1000 fully connected
layer softmax
31
3. Image Classification
3.3.1.4 DenseNet169
Table 3.11: The network architecture of DenseNet169
Layers Output size DenseNet 161
Input 128× 128× 3
Convolution 64× 64× 64 7× 7 conv, stride 2
Pooling 32× 32× 64 3×[3 maxpool,]stride 2
DenseNet Block
(1) 32× 32× 256
1× 1 conv
3× 3 conv × 6
Transition Layer 32× 32× 128 1× 1 conv
(1) 16× 16× 128 2× 2[average po]ol,stride 2
DenseNet Block
(2) 16× 16× 512
1× 1 conv
3× 3 conv × 12
Transition Layer 16× 16× 256 1× 1 conv
(2) 8× 8× 256 2× 2[average po]ol,stride 2
DenseNet Block
(3) 8× 8× 1280
1× 1 conv
3× 3 conv × 32
Transition Layer 8× 8× 640 1× 1 conv
(3) 4× 4× 640 2× 2[average po]ol,stride 2
DenseNet Block
(4) 4× 4× 1664
1× 1 conv
3× 3 conv × 32
Classification 1664 4× 4 global average pool
layer 1000 fully connected, softmax
Table 3.11 shows the full scale of the DenseNet169 based on the 128*128*3 input
image. Since the DenseNet169 is the backbone for extracting the feature, therefore
includes no fully dense layer in the original network. Adding which classifier will
achieve the best network performance is the content of the next section.
3.3.2 The classifier
3.3.2.1 The fully connection layer
After the last layer from the pre-trained model, the multi-dimensional data will be
flattened first, then connected with trainable weights and bias in the dense layer.
The weight updates were undergoing mainly in this part. In this experiment, the first
dense layer consists of 256 neurons was successively followed by the relu activation
and drop-out layer. The second dense layer contains two neurons representing the
probabilities for the target output 1 or 0 after the softmax activation.
32
3. Image Classification
Figure 3.10: The fully connected classifier
Table 3.12: Different feature extractor with fully connection layer
Pretrained model Accuracy(%) Recall(%) Precision(%) F1 Score
VGG16 95.24 31.50 72.73 43.96
Inception V3 95.43 50.60 68.74 57.27
DenseNet169 96.74 62.20 78.22 69.30
MobileNet 97.11 66.14 81.55 73.04
3.3.2.2 Random forest
The random forest consists of many random decision trees. The building process of
each tree was by picking data randomly first, then sub-divide the tree node based
on the random feature in the data. The Figure below showcases one example of the
growth of one random tree.
Figure 3.11: The random forest classifier
33
3. Image Classification
Table 3.13: Different feature extractor with random forest classifier
Pretrained model Accuracy(%) Recall(%) Precision(%) F1 Score
VGG16 95.88 38.58 83.05 52.69
Inception V3 95.23 28.35 76.60 41.38
DenseNet169 96.16 40.16 89.47 55.43
MobelNet 96.77 53.54 87.18 66.34
As we can see from Table 3.13, the random forest classifier has a lower recall value
than the fully connected dense layer but higher precision. In this project, there
needs to prioritize the recall ratio as the classification task should be on the passive
side. Hence, the fully connected layer is more suitable for the classification task.
But if one values precision more, the random forest classifier can also be a good
choice, which also takes less time to train.
3.4 Results
As the testing results in the previous section has revealed, the fully connected layer
with MobileNet backbone can offer highest F1 score for both classifier and even
higher than the shallow CNN proposed in the previous section. Moreover, as Table
3.14 showed, the parameter for fully training the MobileNet is not far more than the
shallow CNN. Therefore, it is worthy to implement the train-from-scratch process
on MobileNet with more testing data. Although the accuracy is very high, the
recall ratio still very low in all the training cases(the highest recall ratio was 67%).
Since the recall parameter needs extra attention in the study, applying the focal
loss function with parameter alpha as ten and gamma as one can ease the low recall
problem. Test images were added from 6 to 15 to prevent over-fitting further. The
total training image number still fixed as 3K. The cracked data up-sampling still
applied. However, the ratio was not 1:7 anymore but filled by the augmented-cracked
image and all the uncracked image.
The cost-sensitive loss function was a revised version of the binary cross-entropy
function taking the form:
WeightedCrossEntropy = −w0y log(p)− w1(1− y) log(1− p) (3.1)
Where the wo and w1 are used to prioritizing specified category.
Table 3.14: The parameter comparison
Model layers Parameter(Million)
Shallow CNN 35 3.9
VGG16 transfered 26 16.8
Inception V3 transfered 318 23.9
DenseNet169 transfered 602 19.4
MobileNet transfered 93 7.4
MobileNet fully trained 94 4.2
34
3. Image Classification
Table 3.15: Different network under cost sensitive loss function
Pretrained model Parameter(Million) Accuracy(%) Recall(%) Precision(%) F1 Score
MobileNet un-weighed 4.2 96.98 93.28 98.15 95.65
MobileNet α = 1 4.2 97.29 94.70 97.54 96.10
MobileNet α = 0.75 2.6 97.15 94.34 97.50 95.90
MobileNet α = 0.5 1.3 96.91 94.34 96.79 95.55
MobileNet α = 0.25 0.5 95.04 94.34 91.91 93.12
Shallow CNN 3.9 91.03 83.02 90.91 86.79
From Table 3.15, we can see as the alpha value decreases, the reduction on the recall
ratio was not too much, but the precision ratio keeps dropping. When the alpha
ratio decrease to 0.25, the accuracy has plunged to 95%. As mentioned previously, a
high recall ratio suggests safe prediction is in this project. Al-through the parameter
will increase when α = 1, it will still be the final choice.
(a) Uncracked image (b) Cracked image
Figure 3.12: The classification results
The Figure 3.12 below shows an example coming out from the classification results
of a fully trained Mobile network.
Based on the classification performance, it is reasonable to divide the testing image
into three categories. They were the image has an ordinary concrete surface, contains
noise, and has crack-like features. The network identifies the red patches as a cracked
sub-image.
35
3. Image Classification
(a) ordinary concrete surface image (b) ordinary concrete surface
(c) Crack with noise image (d) crack with crack-like features
Figure 3.13: The classification results
From Figure 3.13, the whole scale picture for ordinary concrete surface has very
accurate classification results. Some cracked sub-images are still missing in the sub-
picture (d) and (c). It is reasonable to conclude that the network will be confused
once one image has too many cracked-like features, then the recall ratio will decrease
in consequence.
3.5 Conclusion
In this section, we have reached high accuracy in the cracked image classification
task. A series of testing has run on choosing proper training parameters and differ-
ent convolutional neural network architecture. The results show that the MobileNet
model has very high training efficiency and exceptional performance on image clas-
sification. Using focal loss function and data up-sampling technique has to get a
higher recall ratio. However, the results also show that if further optimization is
needed, how to distinguish the crack-liked features and the authentic crack should
be ameliorated. In this case, an image that contains too many crack-like features
will impair the recall ratio, the Despite the limitation, the results still show that by
providing the MobileNet appropriate training hyper-parameter, one can accomplish
the crack classification task with high accuracy successfully.
36
4
Image Segmentation
Image segmentation refers to the process of recognizing and localizing different el-
ements in a digital image. More specifically, different labels are assigned to each
pixel in a digital image in the way that the pixels having the same label represent
the same object.
In Chapter 3, a CNN classifier is developed which can recognize the presence of a
crack in a sub-image context. But the sub-image resolution is still not sufficient for
the subsequent structure performance prediction procedure. Though cracks have
been located within the range of a sub-image, some information like the shape and
the width is still missing, and these features of cracks can be essential for structure
analysis.
The goal of this chapter is to develop an image segmentation technique that can
extract cracks from a sub-image classified as positive by the previous CNN clas-
sification step. With the image segmentation processing, geometry information of
cracks can be extracted and the resolution of the cracks will be increased from a
sub-image patch to pixel level.
There are few reasons for performing image segmentation on sub-images rather than
full-images. Full images usually cover a large area of the concrete surface and cracks
are usually very small objects in a full-sized photo. For image segmentation based
on CNN, the identifications of small objects are much more difficult to achieve, and
area that is not related to a crack can become unnecessary interference, which not
only consumes extra computational time, but is also harmful for the segmentation
results. With the help of CNN classification, only the sub-images that related to
cracks are concerned and all the other nun-concerned area is excluded, which will
significantly increase the efficiency of the image classification process.
In this chapter, CNN models based on the famous U-net architecture for image
segmentation tasks are established and trained with manually annotated crack sub-
images. To address the issue of small object detection, different loss functions are
used in the training. The models are tested on different kinds of crack sub-images
and evaluated quantitatively and qualitatively. Finally, the best model is chosen
considering both segmentation performance and computational efficiency.
37
4. Image Segmentation
4.1 U-net architecture
Convolutional neural networks are mostly used for data classification purpose. Image
segmentation is basically the classification for pixels. This requires a network to
understand the entire context of the input image. The U-net architecture is one of
the most famous CNN architecture, which was first established for biomedical image
processing and won Cell Tracking Challenge at ISBI in 2015 [29]. The architecture
of U-net is shown in Figure 4.1.
1024 1024
Bottleneck
512 512 512 512 512 512
DownConv 4 UpConv 6
256 256 256 256 256 256
DownConv 3 UpConv 7
128 128 128 128 128 128
DownConv 2 UpConv 8
6464 64 64 64 64 2
DownConv 1 UpConv 9 OutPut
Conv 3x3, Up-conv Conv 1x1,
DropOut MaxPool Concatenate
ReLU 2X2 2x2, ReLU Sigmoid
Figure 4.1: U-net Architecture
The U-net architecture has a symmetrical structure, which is similar to a encoder-
decoder neural network. It consists of a down-sampling path, an up-sampling path
and the skip connections between the layers of two paths. The interface of two paths
is the ’Bottle Neck’. For both down-sampling and up-sampling paths, 4 convolutional
units are employed. Table 4.1 shows the detail layer components of each unit. The
stride value for all the convolutional kernels is 1 pixel on both direction, which means
the kernel slides 1 pixel at a time on the input tensor. And the padding is set to
’same’, which means the output feature maps of convolutional layers has the same
size with the input tensor. The size of all the MaxPolling kernels is 2×2 and the
stride value is 2, which means the output of MaxPolling layers has half the size of
the input tensor.
38
4. Image Segmentation
Table 4.1: Layer Components of Each Unit in U-net
Unit Layers Kernel Size Number ofFeature Maps
Conv, ReLU 3×3 64
DownConv_1 Conv, ReLU 3×3 64
MaxPooling 2×2 -
Conv, ReLU 3×3 128
DownConv_2 Conv, ReLU 3×3 128
MaxPooling 2×2 -
Conv, ReLU 3×3 256
DownConv_3 Conv, ReLU 3×3 256
MaxPooling 2×2 -
Conv, ReLU 3×3 512
DownConv_4 Conv, ReLU 3×3 512DropOut - -
MaxPooling 2×2 -
Conv, ReLU 3×3 1024
BottleNeck Conv, ReLU 3×3 1024
DropOut - -
UpConv, ReLu 2×2 512
UpConv_6 Concatenate - [512+512]Conv, ReLU 3×3 512
Conv, ReLU 3×3 512
UpConv, ReLu 2×2 256
UpConv_7 Concatenate - [256+256]Conv, ReLU 3×3 256
Conv, ReLU 3×3 256
UpConv, ReLu 2×2 128
UpConv_8 Concatenate - [128+128]Conv, ReLU 3×3 128
Conv, ReLU 3×3 128
UpConv, ReLu 2×2 64
UpConv_9 Concatenate - [64+64]Conv, ReLU 3×3 64
Conv, ReLU 3×3 64
Output Conv, ReLU 3×3 2Conv, Sigmoid 1×1 1
In the up-sampling units, the size of input tensor is first doubled in UpConv layers
and than convolutional transformation is performed, with kernels sized 2×2. Figure
4.2 shows a example of the UpConv layer. The extra columns and rows at the right-
bottom corner are added to fulfill the same padding condition and the pixels of
these area have values of 0. In this way, the outputs of UpConv layers are doubled-
sized. After UpConv layers, the doubled-sized feature maps are concatenated with
the feature maps from the corresponding unit in the down-sampling path and the
39
4. Image Segmentation
merged feature maps are processed with 2 more convolution layers.
Figure 4.2: An example of UpConv layer with 2×2 kernels, stride 1 and same
padding
The original U-net architecture is established for medical image segmentation, which
is a much more complicated task than the identification of concrete cracks. It is not
necessary to employ such a huge CNN model for crack detection purpose since it
takes more computational power. In this study, smaller models are built in two
ways. Firstly, the feature map numbers of each Conv layer are halved. Secondly,
one unit is removed for both down-sampling and up-sampling path. In these ways. 4
architectures are established, namely Unet-4×64(original architecture), Unet-4×32,
Unet-3 × 64 and Unet-3 × 32. The first number denotes the unit number in each
path and the second number denotes the feature number of the first Conv layer in
the down-sampling path. These models will be trained and compared and the model
with the best performance will be chosen.
4.2 Data-set
Unlike CNN classification, the output of image segmentation CNN is not a single
number that indicates sub-image categories, but a single channel image, where pixel
values mean the possibility of being a crack. The ground truth of the training data
should also be binary images, where a pixel can either be the background (0) or the
crack (1).
230 sub-images sized 128 × 128 introduced in chapter 3 together with 220 images
selected from METU data-set [30] are used to establish the data-set. The METU
crack image data-set is a public image classification data-set. It provides images
images sized 227 × 227 with binary labels indicating whether cracks present in the
image but the pixel level annotation is not available. It is used here to enrich the
data-set for training and among the METU images, 170 are resized to 128 × 128,
and the rest will keep the size of 227 × 227 to generates more images by random
40
4. Image Segmentation
cropping.
The original crack images were annotated manually with 4 × 4 pixel patch, consid-
ering the balance of efficiency and accuracy. A binary image indicating the specific
location and shapes of the cracks in the image is generated for each crack sub-image.
Figure 4.3 shows few crack sub-images and the corresponding annotations.
Figure 4.3: Visualization of the Annotated Crack Image data-set
The amount of crack sub-images is still far too less to train a neural network having
millions of parameters. Data augmentation is applied to expand the data-set. For
the sub-images sized 227 × 227, 4 images sized 128 × 128 are generated by random
cropping. For all images sized 128 × 128 including those cropped images, 7 more
images are further generated by flipping and rotating. Figure 4.4 and 4.5 show
examples of the data augmentation for sub-images with different sizes.
Figure 4.4: An example of flip and rotation for images sized 128×128
41
4. Image Segmentation
Figure 4.5: An example of random cropping for images sized 227×227
As a result, the final data-set has 4800 sub-images and annotation. The data-set
is divided into training, validation and test sets according to a ration of 24:5:1, i.e,
3840 for training, 800 for validation and 160 for testing.
4.3 Unbalanced segmentation and loss functions
For the specified crack segmentation task, a major obstacle for training is the fact
that cracks are usually very small objects in the sub-images. In the 4800 sub-images,
positive pixels only take approximately 10%. This unbalance of two classes can
causes problems for CNN models to recognize the target area in the input images.
Specifically, false negative predictions creates incredibly less loss than false positives
prediction during the training, thus the model learns more about the negative area
than the positive area and results in a poor performance.
One of the reasons for this unbalanced segmentation problem is the fact that the
commonly used loss function Binary Cross Entropy (BCE) treats the false positive
and false negative predictions equally. Researches in this area have tried to change
the definition of the loss function for the training. In this study, to generalize a
suitable models for crack segmentation task, models trained with generalized loss
functions are compared.
The definitions of generalized loss functions used in this study is introduced in this
section. Without loss of generality, the formulations of the selected loss functions
are expressed in multi-class classification case. For class c, float variable pnc ∈ [0, 1]
represents the probability. Binary variable gnc ∈ {0, 1} is the ground truth of pixel
n being that class. N is the total number of pixels in a sub-image. ϵ is a small real
number to prevent the situation of dividing by zero.
1. Binary Cross Entropy (BCE): As introduced in subsection 2.2.2.1, Binary
Cross Entropy is the most common loss function. The formula of BCE can be
written as:
∑N
BCE = − gnc log(pnc) + (1− gnc) log(1− pnc) (4.1)
n=1
42
4. Image Segmentation
2. Dice Loss(DCL): Dice Loss is established based on the Dice Score Coefficient
(DSC), which describe the overlap extent of the predicted result and the
ground truth.[31]. The formulation of Dice Score Coefficient is:
∑ ∑N
= ∑ n=1 pn∑cgnc + ϵDSC N N (4.2)
c n=1 pnc + n=1 gnc + ϵ
Dice Loss (DCL) is a target function to be minimized in the training, first
proposed by Milletari et al.[32]:
DCL = 1−DSC (4.3)
3. Tversky Loss(TL): Tversky Index (TI) also describes the overlap of predic-
tion and ground truth, but has different weights on false negative and false
positive predictions by coefficients α and β [33]. When α = β = 0.5, Tversky
Index is equivalent to Dice Score Coefficient. In this study, the values for these
coefficients are α = 0.7 and β = 0.3, since for small target segmentation tasks
false negative predictions have more influence on the final prediction. The
expression for Tversky Index (TI) and Tversky Loss (TL) are:
∑ ∑N pncgnc + ϵ
TI = ∑ n=1N ∑ (4.4)
c n=1 p g
N
nc nc + α n=1(1− p )g +
∑
β Nnc nc n=1 pnc(1− gnc) + ϵ
TL = 1− TI (4.5)
4. Focal Tversky Loss(FTL): Abraham et al.[34] proposed Focal Tversky Loss
by adding a exponent index γ to Tversky Loss. The index γ is smaller than 1
and in this way, when TI is getting large, which means a better performance,
FTL decreases more significantly. In this study the γ is set to 0.75. The
expression for FTL is:
FTL = (1− TI)γ (4.6)
4.4 Evaluation Criteria
In this study, the evaluations of models are based on two indicators, Sensitivity
(SEN) and Specificity (SPEC). These tow indicators are the correction ratio for
each class and represent the ability to recognize cracks and background respectively.
Sensitivity and Specificity can be expressed by the number of true positive (TP),
true negative (TN), false positive (FP) and false negative (FN) predictions, shown
in equation (4.7).
TP TN
SEN = SPEC = (4.7)
TP + FN TN + FP
43
4. Image Segmentation
Figure 4.6 shows an example of prediction results and corresponding sensitivity and
specificity values.
Figure 4.6: An example of true positive, true negative, false positive and false
negative predictions
In the field application the model faces complicated situations of input images. To
evaluate the performance of the models under different situations, the test data set
are divided into five groups of wide, thin, multi, blur and noise (shown in Figure
4.7) manually and the model performance is evaluated on each group in terms of
mean SEN and SPEC.
(a) Wide (b) Thin (c) Multi
(d) Blur (e) Noise
Figure 4.7: Different groups of test images for evaluation
Lastly, an important aspect of concern is the computational cost of each model,
which is only related to the architecture of models but not influenced by the training
methods or loss functions. These four architectures mentioned in section 4.1 will
be tested on same computational environment and compared in terms of mean
processing time for one sub-image.
44
4. Image Segmentation
4.5 Training and Results
The models are built and trained on Google Colab platform, the same machine
learning environment used in Chapter 3. The specifications of hardware and software
environment are listed in Table 3.1
All the models are trained for 100 epochs on the training data-set, with optimizer
Adam and learning rate 10−4. After each training epoch the Loss, SEN and SPEC
value for training and validation set is recorded to plot training curves. Figure 4.8
shows the training curves for sensitivity value, for the same model trained with
different loss functions. From the training curves it can be concluded that DCL, TL
and FTL can all improve the training progress and models trained with these loss
functions converge at a higher value of sensitivity, resulting in a batter performance
than those trained with BCE. But the effects of these loss functions differ for different
architectures.
(a) Model 4X64 (b) Model 4X32
(c) Model 3X64 (d) Model 3X32
Figure 4.8: Training curves of models trained with different loss functions
A prediction of an certain sub-image from validation set is made after each epoch to
learn and compare how different models progress. Figure 4.9 shows the predictions
of the chosen validation sub-image after first 3 epochs and after 10 and 100 epochs
for each training. The first prediction is a result of the random initialization of the
model parameters.
45
4. Image Segmentation
(a) Models trained with BCE
(b) Models trained with other loss functions
Figure 4.9: Training progression of different models
46
4. Image Segmentation
Table 4.2 and 4.3 shows the performance of different models in terms of Sensitivity
and Specificity on different test image groups as introduced in section 4.4. The
global column shows the indicator values for the whole test set. The optimal results
for each column are marked with bold font.
Table 4.2: Sensitivity of different models
Loss Architecture Sensitivity (%)Functions Thin Wide Multi Noise Blur Global
Unet-4x64 73.79 90.04 74.80 72.47 76.18 82.64
BCE Unet-4x32 71.46 89.70 71.33 74.77 73.80 80.32Unet-3x64 76.73 90.00 76.07 71.99 76.23 82.72
Unet-3x32 77.58 91.30 77.62 70.05 74.33 82.44
Unet-4x64 84.62 94.89 84.48 79.11 87.24 87.62
FTL Unet-4x64 83.47 94.62 83.88 77.07 84.24 86.49Unet-4x64 84.16 93.46 83.73 79.23 85.14 86.66
Unet-4x64 81.18 91.31 81.44 71.74 77.24 82.82
Unet-4x64 83.04 93.04 81.50 78.92 83.40 85.63
TL Unet-4x32 79.78 93.65 81.88 74.98 84.55 84.68Unet-3x64 86.58 93.28 84.62 80.16 82.43 87.12
Unet-3x32 82.03 93.95 83.19 76.78 81.24 85.36
Unet-4x64 85.99 94.49 83.64 81.16 86.24 87.85
DCL Unet-4x32 88.35 94.18 85.07 78.71 87.07 88.36Unet-3x64 74.59 89.31 76.49 72.35 75.38 79.56
Unet-3x32 77.02 91.49 77.27 73.96 76.53 81.39
Table 4.3: Specificity of different models
Loss Specificity (%)
Functions Architecture Thin Wide Multi Noise Blur Global
Unet-4x64 98.93 98.71 98.31 98.69 99.12 98.76
BCE Unet-4x32 99.03 99.00 98.35 98.83 99.24 98.92Unet-3x64 99.04 98.74 98.26 98.78 99.21 98.81
Unet-3x32 99.12 99.02 98.65 97.72 99.35 98.86
Unet-4x64 98.44 97.86 97.14 97.94 98.41 97.99
FTL Unet-4x64 98.44 97.78 97.27 97.90 98.58 98.00Unet-4x64 98.39 97.93 97.02 98.08 98.57 98.02
Unet-4x64 98.50 98.17 97.30 98.40 98.88 98.25
Unet-4x64 98.55 98.33 97.48 98.44 98.85 98.34
TL Unet-4x32 98.68 98.16 97.56 98.62 98.78 98.34Unet-3x64 98.22 97.89 97.00 98.36 98.79 98.02
Unet-3x32 98.46 97.80 97.20 98.20 98.57 98.04
Unet-4x64 98.38 97.94 97.28 98.51 98.61 98.12
DCL Unet-4x32 98.18 97.86 97.10 98.34 98.54 97.98Unet-3x64 98.93 98.79 98.12 98.85 99.24 98.79
Unet-3x32 98.85 98.58 98.01 98.38 99.04 98.60
47
4. Image Segmentation
The performance of different models are also compared visually. Figure 4.10 shows
the prediction results of Unet-4x32 models trained with different loss functions.
Figure 4.11 shows the prediction results of different model architectures trained
with DCL.
Figure 4.10: Comparison of Unet-4x32 models trained with different loss functions
48
4. Image Segmentation
Figure 4.11: Comparison of different model architectures trained with DCL
Table 4.4 compares the computational cost of different architectures in terms of
mean processing time on a 128 × 128 sub image. The best result is marked with
bold font.
Table 4.4: Computational costs of models
Model Number of Parameter Processing timeper sub-image (s)
Unet-4x64 310,328,37 0.5197
Unet-4x32 7,760,645 0.1808
Unet-3x64 7,698,437 0.4018
Unet-3x32 1,926,149 0.1475
49
4. Image Segmentation
4.6 Summary
Theoretically, a larger CNN architecture usually has more potential to finish a cer-
tain task than a smaller one. But it’s unnecessarily true that a larger model always
reaches a more satisfying result. It can be concluded from Table 4.2 that with the
training setup and data-set specified in Section 4.2 and Section 4.5, the best model
can reach 88.36% for sensitivity. The performance of models is also influenced by
different crack image types. Thin cracks, blurred images, cracks with complicated
shapes and the presence of noise can all have negative influence to the final pre-
diction at different extents. On the other hand, a larger architecture costs more
training time and computational power in the application. Comparing Unet-4x64
and Unet-4x32 illustrated in Table 4.4, more than 65% of the processing time is
saved by decreasing the number of feature maps.
From both training curves (Figure 4.8) and visualized training progression (Figure
4.9) it can be concluded that DCL, TL and FTL can make the models focus more on
the positive predictions and thus increase the sensitivity to crack pixels. While the
recognition ratio of true positive pixels is improved, false positive results are also
increased and cause decrease for the specificity. Comparing Table 4.2 and Table
4.3, it is obvious the benefit gained from these generalized loss functions is more
considerable than the compromise in specificity, especially for the thin crack cases.
Overall, the Unet-4x32 trained with DCL is proposed to be used for the sub-image
crack segmentation task, which not only has a stable performance for different test
groups and the highest sensitivity on the test data-set, but also save considerable
computational power.
50
5
Experiment
As introduced in Section 1.5, the proposed algorithm is a combination of the classi-
fication and segmentation CNNs. In this chapter, the performance of this combined
algorithm is tested on photo of cracks.
The algorithm is first tested on the photos collected from real structures by a hand-
held camera, which has a resolution of 1836x3264 pixels. These photos are also the
data source of the sub-images used for training of the two CNN models. The image
classification and segmentation result are shown in Figure 5.1, where the sub-images
classified as positive are marked with red squares and the pixel-level segmentation
highlights the cracks with red color. Since the accuracy of CNN models are always
increased on training data, the test performed on these photos is just to show how
the combined algorithm works.
Figure 5.1: Full image test
51
5. Experiment
To learn the performance of the proposed algorithm and the feasibility of this method
in the field application, a experimental inspection is conducted with a drone. In the
field situation, to get a clear photo of thin cracks, the drone needs to get very close to
the structure and the camera lens need to have a long focal length. DJI Mavic2 Zoom
(Figure 5.2) is chosen to perform this inspection. The powerful omnidirectional
obstacle sensing system and the 2x optical zoom function make it possible to shoot
cracks from a safety distance of approximately 1.5 3m. The specifications of this
drone model and the camera are listed in Table 5.1
Figure 5.2: DJI Mavic2 Zoom
Table 5.1: Specification of the Drone and Camera
Drone Model DJI Mavic2 Zoom
Takeoff Weight 905 g
Max Speed 20 m/s
Max Wind Resistance 38 kph
Lens 24-48 mm (35 mm Format Equivalent)
Camera sensor 1/2.3" CMOS
Image Size 4000Œ3000
Image Format .JPG
The inspected objects are a wall in the structure lab and a damaged concrete beam
specimen (Figure 5.3). The wall is inside a building and the drone has to be operated
at a indoor environment. A minimum distance of 1.5m from the wall can be reached.
The left photo in Figure 5.3 also shows the operation of the drone.
Figure 5.3: The damaged wall and concrete specimens
52
5. Experiment
The wall experienced a crush by a truck and multiply cracks can been seen on the
surface. It is a brick wall but the rough surface has similar features with a concrete
wall. Figures 5.4 shows the photos taken by the drone and the crack detection results
of the algorithm. The cracks has been very well detected but there are also false
positive detection from both classification and segmentation procedures. These false
identifications mostly occurs at the edges of the bricks. This is most likely cased by
the fact that all the training data for both CNNs is collected from concrete structures
and the algorithm is relatively vulnerable to the brick wall situation.
Figure 5.4: Images of the damaged wall and crack detection results
53
5. Experiment
The specimen is taken from Gutenberg harbor for research purpose. The specimen
is severely damaged and very wide cracks can be found at the mid-span. Some
marks on the surface indicating the location of the cracks have been made. As can
be seen from Figure 5.5, these marks cause some false positive prediction, while on
the same photo, the algorithm still managed to recognize the cracks. A very wide
crack appears in the third photo and the algorithm fail to detect it. The reason
of this failure is likely to be that the width of the crack in the image exceeds the
sub-image size.
Figure 5.5: Images of the damaged specimen and crack detection results
54
6
Conclusion and prospect
6.1 Conclusion
This project proposes an image-based crack detection algorithm that can achieve
automatic crack extraction without any human intervention. Two CNN models are
used for classification and segmentation respectively. The whole prediction process
which includes both classification and segmentation will only take 86 seconds in
total. The visual performance on the whole scale testing image indicates the high
accuracy of both CNN intuitively. This project has shown the practicability of
applying the deep learning method into the structural health monitoring field.
The biggest challenge in this project is the variety of the image database. This
database is very raw and realistic of how a drone token close-up photo will look
like. The large image variation will debase the final prediction accuracy to some
degree without doubly. In the tough process of searching for the "best CNN model",
a lot of experiments have been tried out. However, the conclusion is that transfer
learning can actually give a quick first glance of how suitable the designed deep
CNN network can work on the cross-domain data-set. A more efficient way of
finding suitable network architecture can reach by applying the transfer learning
method firstly, then the appropriate parameter customized for each specified working
case should be sought by carrying out different experiments. For example, in this
project, the alpha value representing the number of the channel has been modified
and various loss function has been tried out. The series of loss functions rooted in
weighed cross-entropy has shown superiority when the data-set is highly imbalanced.
Finally, small ad-justification like the trade-off between computational power and
prediction performance indicator can be made based on the priority for different
environments.
Despite the good results on both the numerically side and visual effect side. The
training effort still should be counted. In this project, both CNN have a long process
for training if one pursues real-time crack detection. Moreover, the network trained
on the crack image not necessarily has high accuracy on the testing image if the
testing image differs a lot from the original training image, which might suggest
the network should keep training when it goes for industrial usage. The limitation
of this automated crack detection algorithm is that the classification step is deci-
sive to the overall performance, which will cause vulnerability once the classifier
pronounces false positive or false negative. One of the optimizing directions is to
55
6. Conclusion and prospect
reducing the probability threshold for positive admission to decrease the false nega-
tives then developing a dapper algorithm based on the crack geometry feature after
the segmentation step to filter out the increased false positives.
6.2 Prospect
As mentioned in the introduction, the proposed algorithm serves as a part of the
Digital Twin concept. To achieve the target of SHM based on digital representa-
tions of real structures, the proposed crack detection algorithm should be integrated
properly into the work flow of Digital Twin.
Data acquisition:
The proposed algorithm takes digital images as input. Drones with high resolution
camera provides a highly efficient way to perform visual inspection and image data
acquisition. Different country has different regulations about the distance when
one flies a drone above the infrastructure. But it all needs the operator to find
cracked religion from a distance. With respect to this demand, the object detection
technique might be the solution. There has been a lot of researches built the object
detection method[35] in drone and located the cracked area successfully. By flying
the drone to the target location, the drone can bring the close-up crack pictures
back.
Figure 6.1: One example for applying object detection method to locate the
cracked area in concrete bridge
The finite element model:
In order to map the cracked sub-image into the bridge model, the 3D cloud technique
might be used. This means the image should contain its own location information
or relative location information in the first place. To successfully utilize the crack
detection results, this mapping process needs to be accurate as well.
56
Bibliography
[1] Yuequan Bao, Zhicheng Chen, Shiyin Wei, Yang Xu, Zhiyi Tang, and Hui Li.
The State of the Art of Data Science and Engineering in Structural Health
Monitoring. Engineering, 5(2):234–242, 2019.
[2] Elisa Negri, Luca Fumagalli, and Marco Macchi. A review of the roles of digital
twin in cps-based production systems. Procedia Manufacturing, 11:939–948,
2017. 27th International Conference on Flexible Automation and Intelligent
Manufacturing, FAIM2017, 27-30 June 2017, Modena, Italy.
[3] Yi-Chen Zhu, David Wagg, Elizabeth Cross, and Robert Barthorpe. Real-
Time Digital Twin Updating Strategy Based on Structural Health Monitoring
Systems. In Zhu Mao, editor, Model Validation and Uncertainty Quantification,
Volume 3, pages 55–64, Cham, 2020. Springer International Publishing.
[4] Abdulmotaleb El Saddik. Digital Twins: The Convergence of Multimedia Tech-
nologies. IEEE MultiMedia, 25:87–92, 2018.
[5] Sandeep Sony, Shea Laventure, and Ayan Sadhu. A literature review of next-
generation smart sensing technology in structural health monitoring. Structural
Control and Health Monitoring, 26(3):e2321, 2019.
[6] Sandeep Sony, Kyle Dunphy, Ayan Sadhu, and Miriam Capretz. A systematic
review of convolutional neural network-based structural condition assessment
techniques. Engineering Structures, 226(August 2020):111347, 2021.
[7] Lei Zhang, Fan Yang, Yimin Daniel Zhang, and Ying Julie Zhu. Road crack
detection using deep convolutional neural network. In 2016 IEEE international
conference on image processing (ICIP), pages 3708–3712. IEEE, 2016.
[8] Byunghyun Kim and Soojin Cho. Automated Vision-Based Detection of Cracks
on Concrete Surfaces Using a Deep Learning Technique. Sensors, 18(10), 2018.
[9] Fangzheng Lin, Jiesheng Yang, Jiangpeng Shu, and Raimar J Scherer. Crack se-
mantic segmentation using the u-net with full attention strategy. arXiv preprint
arXiv:2104.14586, 2021.
[10] Lingxin Zhang, Junkai Shen, and Baijie Zhu. A research on an improved unet-
based concrete crack detection algorithm. Structural Health Monitoring, page
1475921720940068, 2020.
[11] Ikhlas Abdel-Qader, Osama Abudayyeh, and Michael E Kelly. Analysis of edge-
detection techniques for crack identification in bridges. Journal of Computing
57
Bibliography
in Civil Engineering, 17(4):255–263, 2003.
[12] Tomoyuki Yamaguchi, Shingo Nakamura, Ryo Saegusa, and Shuji Hashimoto.
Image-based crack detection for real concrete surfaces. IEEJ Transactions on
Electrical and Electronic Engineering, 3(1):128–135, 2008.
[13] Takafumi Nishikawa, Junji Yoshida, Toshiyuki Sugiyama, and Yozo Fujino.
Concrete crack detection by multiple sequential image filtering. Computer-
Aided Civil and Infrastructure Engineering, 27(1):29–47, 2012.
[14] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature,
521(7553):436–444, 2015.
[15] Arun Mohan and Sumathi Poobal. Crack detection using image processing: A
critical review and analysis. Alexandria Engineering Journal, 57(2):787–798,
2018.
[16] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction
and functional architecture in the cat’s visual cortex. The Journal of physiology,
160(1):106–154, 1962.
[17] Kunihiko Fukushima and Sei Miyake. Neocognitron: A new algorithm for
pattern recognition tolerant of deformations and shifts in position. Pattern
recognition, 15(6):455–469, 1982.
[18] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E
Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recog-
nition with a back-propagation network. In Advances in neural information
processing systems, pages 396–404, 1990.
[19] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of the thirteenth international con-
ference on artificial intelligence and statistics, pages 249–256. JMLR Workshop
and Conference Proceedings, 2010.
[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classifica-
tion with deep convolutional neural networks. Communications of the ACM,
60(6):84–90, may 2017.
[21] Andrej Karpathy, FF Li, and J Johnson. Cs231n convolutional neural networks
for visual recognition (2016). URL http://cs231n. github. io, 50, 2017.
[22] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.
arXiv preprint arXiv:1607.06450, 2016.
[23] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How
does batch normalization help optimization? arXiv preprint arXiv:1805.11604,
2018.
[24] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and ac-
curate deep network learning by exponential linear units (elus). arXiv preprint
arXiv:1511.07289, 2015.
[25] Young-Jin Cha, Wooram Choi, and Oral Büyüköztürk. Deep learning-based
58
Bibliography
crack damage detection using convolutional neural networks. Computer-Aided
Civil and Infrastructure Engineering, 32(5):361–378, 2017.
[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016.
[27] Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv 1409.1556, 09 2014.
[28] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets:
Efficient convolutional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861, 2017.
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional
networks for biomedical image segmentation. In Nassir Navab, Joachim Horneg-
ger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Com-
puting and Computer-Assisted Intervention – MICCAI 2015, pages 234–241,
Cham, 2015. Springer International Publishing.
[30] Cao Vu Dung et al. Autonomous concrete crack detection using deep fully
convolutional neural network. Automation in Construction, 99:52–58, 2019.
[31] Lee R Dice. Measures of the amount of ecologic association between species.
Ecology, 26(3):297–302, 1945.
[32] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully con-
volutional neural networks for volumetric medical image segmentation. In 2016
Fourth International Conference on 3D Vision (3DV), pages 565–571, 2016.
[33] Seyed Raein Hashemi, Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, Sanjay P
Prabhu, Simon K Warfield, and Ali Gholipour. Tversky as a loss function
for highly unbalanced image segmentation using 3d fully convolutional deep
networks. arXiv preprint arXiv:1803.11078, 2018.
[34] Nabila Abraham and Naimul Mefraz Khan. A novel focal tversky loss function
with improved attention u-net for lesion segmentation. In 2019 IEEE 16th
International Symposium on Biomedical Imaging (ISBI 2019), pages 683–687,
2019.
[35] Chaobo Zhang, Chih-chen Chang, and Maziar Jamshidi. Concrete bridge sur-
face damage detection using a single-stage detector. Computer-Aided Civil and
Infrastructure Engineering, 35(4):389–409, 2020.
59
Bibliography
60
DEPARTMENT OF ARCHITECTURE AND CIVIL ENGINEERING
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se