Manually Mapping Model Elements onto
the Modeled Code by Analyzing GitHub
Data

Master’s thesis in Computer science and engineering

WENLI ZHANG

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

Manually Mapping Model Elements onto the
Modeled Code by Analyzing GitHub Data

WENLI ZHANG

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Manually Mapping Model Elements onto the Modeled Code by Analyzing GitHub
Data
WENLI ZHANG

© WENLI ZHANG, 2023.

Supervisor: Regina Hebig, Computer Science and Engineering
Examiner: Christian Berger, Computer Science and Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2023

iv


Manually Mapping Model Elements onto the Modeled Code by Analyzing GitHub
Data

WENLI ZHANG
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Context: Class diagrams are one of the most popular UML models and are fre-
quently used in the early stages of software development. The advantage of using
class diagrams is that they can reflect design decisions and the system’s implemen-
tation structure. Maintainers can use class diagrams to understand the system’s
implementation structure. Yet, as the code evolves, the absence of updating class
diagrams will cause the code implementation to deviate from the class diagram de-
sign. One concern is that such a divergent class diagram does not help maintainers
much in the same way during the maintenance stage. As a solution, reverse engi-
neering methods/tools can reverse code into class diagrams. Yet, another concern
comes up, in most cases, the reverse-engineered class diagrams are not abstract,
and they contain extensive information that will burden the understanding of the
system’s implementation structure. This is because the existing reverse engineering
methods/tools are imperfect as they do not manage to imitate the human ability to
abstract relevant information from the source code. Surprisingly, existing studies on
the characteristics of manual abstraction are based on the opinions and experiences
of participants but do not study actual cases of models and source code. Also, the
methods/technologies used for checking the similarities and differences between the
models and source code are purely structural but do not analyze or take the se-
mantics of the model elements into account when mapping classes from models and
code. The semantics is closely related to abstraction creation. Thereby, a systematic
manual study on the characteristics of manual abstraction is required.

Aim: To fill this gap, this thesis aimed at studying the characteristics of the differ-
ences between the class diagram design and the code implementation by manually
creating the mappings between the class diagram elements/constituents and the
code constructs. Our manual studies can precisely capture the differences between
the class diagram design and source code implementation and investigate the causes
of these differences.

Method: We employed the methodology of five case studies. The five subjects
studied are five Java open-source projects collected from GitHub. They are semi-
randomly selected from the Linholmen dataset [1].

Results: For the differences between the class diagram design and code implemen-
tation, three causes are summarized: various levels of manual abstraction created
in class diagrams, deviations of code implementation from class diagram design,

v


and common changes between the class diagram elements/constituents and code
constructs. We contribute to a sorted list of cases corresponding to these three
causes.

Keywords: UML, Models in Open Source Systems, Reverse Engineering, Deviations
between Code and Design, Manual Abstraction in Modeling

vi


Acknowledgements
Approaching the end of my master’s degree, I would take this opportunity to express
my gratitude to all my lecturers, former colleagues, friends, and family who helped
and supported me through these two and a half years.

My thesis supervisor, Dr. Regina Hebig, for your all-around support and trust in
my thesis work has been like the warmest sunshine in a dark Swedish winter. Your
knowledge, inner drive, and kindness in helping me all around (not limited to aca-
demics) fully affected my attitude toward my future studies and life - being loyal
and optimistic. Of particularly unforgettable days when I fell into the definition of
terminologies, you guided me patiently, even hand in hand, taught and inspired me
to find the best solution to accommodate the puzzle I was in at that time. Your
valuable and interesting learning experience gained when you were a former student
in the past, shared with me, will be an invaluable asset in my life.

My examiner, Dr. Christian Berger, for your valuable input in my thesis proposal
and half-time presentation, helping me think of the methodology of this thesis more
considerably. Dr. Richard Torkar, for whom I served as a teaching assistant for
about six months. Your enthusiasm for research and willingness to help students
out has been encouraging me to want to be a person the way you are. To all of my
lecturers for instilling me with software engineering-related knowledge and cultivat-
ing my problem-solving and independent thinking abilities.

To my former company mentor, Dr. Jianlin Shi, and my director, Yujie Xue, with
whom I worked for half a year in 2021. Your full support and care greatly boosted
my confidence to pursue what I am into (this thesis work) at the campus.

To my friends, who cheered me up and always believed in me, you are all like true
“family members”. We shared our interests and hobbies, patiently taught each other
some life tips, and explored new things together. This constructs a major part of my
life towards a balance of learning and living. To the most important persons coming
into my life - my parents, brother, and sister, who have always influenced me to be
brave and supported me to pursue what I have been interested in since I was a child.

My great interest in modeling and software architecture supported me in finalizing
this thesis. I am grateful to Dr. Regina Hebig again for providing me with this thesis
topic, which helped me to step into my area of interest and establish a framework
for learning which can be further extended in the future.

Without all of you, this thesis would not have been possible.

Wenli Zhang, Gothenburg, February 2023

viii


x


Contents

List of Figures xvii

List of Tables xxiii

List of Acronyms xxiv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Goal and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Case Study Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Definition of Terminologies . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5.1 Examples of Mappings between mcAM Concepts and voSC . . 5
1.5.2 Ideal Selection of One cSC among Multiple cSCs of a mcAM . 9

1.6 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Structure of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Theory and Related Work 15
2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Related Java Knowledge . . . . . . . . . . . . . . . . . . . . . 15
2.1.1.1 Java SE7 Specifications . . . . . . . . . . . . . . . . 16
2.1.1.2 Object-Oriented (OO) Paradigm . . . . . . . . . . . 16

2.1.2 Unified Modeling Language (UML) . . . . . . . . . . . . . . . 17
2.1.2.1 Graphical Notation for Classifiers . . . . . . . . . . . 17
2.1.2.2 Conversions of BNF . . . . . . . . . . . . . . . . . . 18
2.1.2.3 Textual Notation for Attributes . . . . . . . . . . . . 19
2.1.2.4 Textual Notation for Operations . . . . . . . . . . . 22
2.1.2.5 Visibility . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.2.6 Multiplicity . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.2.7 Graphical Notation for Relationships . . . . . . . . . 27

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1 Reverse Engineering . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.1.1 Manual Abstraction Created over Code . . . . . . . . 29
2.2.1.2 Solutions and Attempts to Provide Abstraction for

Reverse-Engineered Class Diagrams . . . . . . . . . . 30
2.2.1.3 Consistency Check(s) between Code and Design . . . 31

xi


Contents

2.2.2 Models in Software Development Practice . . . . . . . . . . . 32
2.2.2.1 Usage of Models . . . . . . . . . . . . . . . . . . . . 32
2.2.2.2 Dataset/Database of Models . . . . . . . . . . . . . . 33

3 Methodology 35
3.1 Open Source Repositories Access . . . . . . . . . . . . . . . . . . . . 36
3.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Definition of Terminologies . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Ideal Selection of One mcAM from a Project . . . . . . . . . . 37
3.2.3 Ideal Selection of One cSC among Multiple cSCs of that mcAM 38

3.2.3.1 Time and Resource Consumption . . . . . . . . . . . 40
3.3 Automatic Reverse Engineering Tool Selection . . . . . . . . . . . . . 41
3.4 Spreadsheet Comparison Template Design . . . . . . . . . . . . . . . 41

3.4.1 Turning mcAM and cSC into Pictures for Differentiating . . . 45

4 Results 49
4.1 Information on the Project Studied . . . . . . . . . . . . . . . . . . . 49
4.2 Suspected Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Cases of the Differences Caused by MA and disAGTs . . . . . . . . . 51

4.3.1 Cases of MA - Classes . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1.1 Case MC-1 - Hierarchical inheritance structure not

created in the mcAM is added to the cSC . . . . . . 53
4.3.1.2 Case MC-2 - One class created in the mcAM is di-

vided into more than one class in the cSC (related
to the specific design patterns) . . . . . . . . . . . . 53

4.3.2 Cases of MA - Attributes . . . . . . . . . . . . . . . . . . . . . 55
4.3.2.1 Case MA-1 - Additional attributes not in the mcAM

are added to the cSC . . . . . . . . . . . . . . . . . . 55
4.3.2.2 Case MA-2∗ - An attribute of classifier A whose type

can indicate an aggregation or a composition be-
tween classifiers A and B from the cSC, is modeled
out by the naming of the association between them
in the mcAM . . . . . . . . . . . . . . . . . . . . . . 55

4.3.2.3 Case MA-3∗ - The additional attribute type not in
the mcAM is added to the cSC . . . . . . . . . . . . 56

4.3.2.4 Case MA-4 - The additional default value not in the
mcAM is added to the cSC . . . . . . . . . . . . . . 56

4.3.2.5 Case MA-5 - One or more common attributes in one
or more subclasses in the mcAM are upshifted to the
hierarchical structure in the cSC, i.e., the superclass
inherited by these subclasses . . . . . . . . . . . . . 57

4.3.2.6 Case MA-6 - One attribute created from the mcAM
is divided into more than one attribute in the cSC . 58

4.3.3 Cases of MA - Operations . . . . . . . . . . . . . . . . . . . . 59
4.3.3.1 Case MO-1 - Additional operations not in the mcAM

are added to the cSC (three subcases) . . . . . . . . 59

xii


Contents

4.3.3.2 Case MO-2 - Additional parameter names not in the
mcAM are added to the cSC . . . . . . . . . . . . . 61

4.3.3.3 Case MO-3 - Additional parameter types not in the
mcAM are added to the cSC . . . . . . . . . . . . . 62

4.3.3.4 Case MO-4 - Additional return types not in the mcAM
are added to the cSC . . . . . . . . . . . . . . . . . . 62

4.3.3.5 Case MO-5∗ - Multiplicity (referring to the collection-
related interfaces) specified for the return type with-
out its corresponding implementation interfaces/classes
specified in the mcAM, yet with them specified in the
cSC . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.3.6 Case MO-6∗ - The default parameters of the opera-
tion (in Android) in the cSC are omitted in the mcAM 63

4.3.3.7 Case MO-7 - One operation created in the mcAM is
divided into multiple operations in the cSC . . . . . 64

4.3.4 Cases of MA - Relationships . . . . . . . . . . . . . . . . . . . 64
4.3.4.1 Case MR-1 - Additional relationships not in the mcAM

are added to the cSC (two subcases with their six and
two corresponding concluded causes, respectively) . . 67

4.3.4.2 Case MR-2 - An aggregation between A (whole) and
B (part) from the cSC is modeled as an association
between A (origin) and B (target) in the mcAM (two
causes) . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3.4.3 Case MR-3 - A composition between A (whole) and
B (part) from the cSC is modeled as an association
between A (origin) and B (target) in the mcAM (one
cause) . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.5 Cases of disAGTs - Classes . . . . . . . . . . . . . . . . . . . . 84
4.3.5.1 Case DC-1 - Hierarchical inheritance structures in

the mcAM are removed in the cSC . . . . . . . . . . 84
4.3.6 Cases of disAGTs - Attributes . . . . . . . . . . . . . . . . . . 85

4.3.6.1 Case DA-1 - Attributes in the mcAM are removed
in the cSC . . . . . . . . . . . . . . . . . . . . . . . . 85

4.3.6.2 Case DA-2∗ - Attributes from the mcAM are re-
placed with additional attributes (that are not in
the mcAM added to the cSC) in the cSC . . . . . . . 86

4.3.6.3 Case DA-3∗ - The attribute type in the mcAM and
cSC is different (two causes) . . . . . . . . . . . . . . 86

4.3.6.4 Case DA-4∗ - Converting variables from the mcAM
to constant variables in the cSC . . . . . . . . . . . . 87

4.3.6.5 Case DA-5∗ - The default value not in the mcAM is
added to the cSC . . . . . . . . . . . . . . . . . . . . 88

4.3.7 Cases of disAGTs - Operations . . . . . . . . . . . . . . . . . 89
4.3.7.1 Case DO-1 - Operations in the mcAM are removed

in the cSC . . . . . . . . . . . . . . . . . . . . . . . . 89

xiii


Contents

4.3.7.2 Case DO-2 - The parameter names in the mcAM are
removed in the cSC . . . . . . . . . . . . . . . . . . . 89

4.3.7.3 Case DO-3 - The parameter types in the mcAM are
removed in the cSC . . . . . . . . . . . . . . . . . . . 89

4.3.7.4 Case DO-4∗ - The parameter type in the mcAM and
cSC is different (one cause) . . . . . . . . . . . . . . 89

4.3.7.5 Case DO-5 - The return types in the mcAM are re-
moved and as void in the cSC . . . . . . . . . . . . . 90

4.3.7.6 Case DO-6∗ - One or more operations from classifier
A in the mcAM are moved to classifier B in the cSC
(related to the particular architectural patterns) . . . 91

4.3.8 Cases of disAGTs - Relationships . . . . . . . . . . . . . . . . 92
4.3.8.1 Case DR-1 - Relationships between A and B from

the mcAM are removed in the cSC (one cause) . . . 93
4.3.8.2 Case DR-2 - A composition between A (whole) and

B (part) from the mcAM is changed into an aggre-
gation in the cSC (one cause) . . . . . . . . . . . . . 94

4.3.8.3 Case DR-3∗ - Relationships between A and B from
the mcAM are replaced with new relationships in the
cSC (two causes) . . . . . . . . . . . . . . . . . . . . 95

4.4 Cases of the Differences Caused by CC . . . . . . . . . . . . . . . . . 98
4.4.1 Cases of CC - Classes . . . . . . . . . . . . . . . . . . . . . . . 100

4.4.1.1 Case CC-1 - Naming of classes in the mcAM and
cSC is different (three causes) . . . . . . . . . . . . . 100

4.4.2 Cases of CC - Attributes . . . . . . . . . . . . . . . . . . . . . 101
4.4.2.1 Case CA-1 - Naming of attributes in the mcAM and

cSC is different (two causes) . . . . . . . . . . . . . . 101
4.4.2.2 Case CA-2 - Naming of attribute types is different

(two causes) . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.3 Cases of CC - Operations . . . . . . . . . . . . . . . . . . . . 103

4.4.3.1 Case CO-1 - Naming of operations in the mcAM and
cSC is different (four causes) . . . . . . . . . . . . . 103

4.4.3.2 Case CO-2 - Naming of parameters in the mcAM
and cSC is different (two causes) . . . . . . . . . . . 104

4.4.3.3 Case CO-3 - Naming of parameter types in the mcAM
and cSC is different (one cause) . . . . . . . . . . . . 105

4.4.3.4 Case CO-4 - Naming of return types in the mcAM
and cSC is different (one cause) . . . . . . . . . . . . 105

4.4.4 Ratios of Cases with Related Projects . . . . . . . . . . . . . . 106
4.4.5 Typical/Common Cases with Related Projects . . . . . . . . . 107
4.4.6 Aggregated Results of the Cases for the Project . . . . . . . . 108

5 Discussion 111
5.1 Reflection on Data Selection . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Limitations of Reverse Engineering Tools . . . . . . . . . . . . . . . . 112
5.3 Advantages of Manual Mappings . . . . . . . . . . . . . . . . . . . . 113

xiv


Contents

5.4 Guideline Enlightened by Cases . . . . . . . . . . . . . . . . . . . . . 113
5.5 Related Technologies for Healing the Cases . . . . . . . . . . . . . . . 115
5.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.6.1 Threats to Construction Validity . . . . . . . . . . . . . . . . 117
5.6.2 Threats to External Validity . . . . . . . . . . . . . . . . . . . 117
5.6.3 Threats to Conclusion Validity . . . . . . . . . . . . . . . . . 118

6 Conclusion and Future Work 119
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2.1 Covering Additional OOP Projects . . . . . . . . . . . . . . . 120
6.2.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . 120

Bibliography 121

A Corresponding mcAMs in the Five Projects Studied I

xv


Contents

xvi


List of Figures

1.1 Three challenges involved in the staged SDLC are likely to cause the
implementation of the source code to deviate from the design of the
class diagram; thereby, differences between the class diagram and the
source code are introduced. . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 In the mcAM, for the concept, e.g., Dog, there is a second concept
Pet is described by the superclass Pet of the class Dog that describes
the concept Dog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 For the superclass Pet of the class Dog that describes a second concept
Pet of a conceptDog in the mcAM, the class Pet in this voSC is
considered to be mapped to that superclass Pet of the class Dog in
the mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 A class SinglePlayModel describes a concept SinglePlayerModel in the
mcAM, yet the naming of the class SinglePlayModel with a misspelling. 6

1.5 The class SinglePlayerModel in this voSC has highly similar attributes
and operations with the class SinglePlayModel in the mcAM and thus,
the class SinglePlayModel is considered to be mapped to the class
SinglePlayerModel in this voSC. . . . . . . . . . . . . . . . . . . . . . 6

1.6 In the mcAM, a concept, i.e., decorators is described by a class
ItemDecorator, which might imply the decorator design pattern would
be applied in the voSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 In this voSC, the concept decorators in the mcAM is described by
the superclass Slot, and the subclasses LeftUpgradeSlot, and Righ-
tUpgradeSlot derived from the superclass Slot. These three classes
are the extension of the interface Icon embedded in the Java library.
This interface Icon is invoked in a class related to View, which is re-
sponsible for communicating with the subclass Upgrade derived from
the superclass Item in the mcAM. . . . . . . . . . . . . . . . . . . . . 7

1.8 In the mcAM, for the concept, e.g., Dog a second concept Pet is
described by the superclass Pet of the class Dog that describes the
concept Dog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.9 This voSC only includes one class Pet that can be mapped to the
superclass Pet of the class, e.g., Dog in the mcAM. . . . . . . . . . . 8

1.10 A concept, i.e., InputFunction is described by a class InputFunction
in the mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.11 The concept InputFunction in the mcAM is described by a superclass
InputFunction that not in the mcAM but added to the voSC. . . . . . 9

xvii


List of Figures

1.12 In the mcAM, eight concepts are described by eight classes, i.e.,
BaseUI, BaseController, IncomeRegisterUI, RegisterIncomeController,
Income, IncomeTyperRepository, CheckingAccount, and IncomeRepos-
itory, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.13 A cSC of the mcAM covers all eight concepts in the mcAM. . . . . . 11
1.14 Compared with the cSC illustrated in Figure 1.13, this cSC of the

mcAM covers four more operations. . . . . . . . . . . . . . . . . . . . 11

2.1 Three abstraction levels of graphical class notation: suppressed (on
the top left corner), analysis (on the right), and implementation (on
the bottom left corner) [21, p. 50]. Compared with the suppressed
level with only the specified name of a class, another two levels present
a more comprehensive layout of a class with more or less) textual
notation for attributes and operations embedded. . . . . . . . . . . . 18

2.2 Specification of Backus-Naur Form (BNF) conversions [21, pp. 16–17]. 19
2.3 Textual notation for attributes [21, pp. 129–130]. . . . . . . . . . . . 19
2.4 Textual notation for operations [21, pp. 107–108]. . . . . . . . . . . . 22
2.5 Syntax for a multiplicity string [21, p. 98]. . . . . . . . . . . . . . . . 26
2.6 Graphical notation for seven types of relationships. . . . . . . . . . . 28
2.7 Four abstraction levels specified for four relationships - dependency,

association, aggregation, and composition, respectively. . . . . . . . . 29

3.1 The overall process of the methodology. . . . . . . . . . . . . . . . . . 35
3.2 The selection process of one ideal cSC among multiple cSCs of the

selected mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Record information about the project. . . . . . . . . . . . . . . . . . 42
3.4 Record information of the mcAM. . . . . . . . . . . . . . . . . . . . . 43
3.5 Record information of the cSC. . . . . . . . . . . . . . . . . . . . . . 44
3.6 Record the differences between the mcAM and the cSC of that mcAM. 44
3.7 Record additional notes. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 For the class CheckingAccount, the attributes, e.g., income: Income

is modeled in the mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9 For the class CheckingAccount, that attribute income: Income from

the mcAM is replaced with fully new attributes incomeRepo: Incom-
eRepository in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.10 We recorded the difference, i.e., attributes from the mcAM are re-
placed with new attributes not in the mcAM yet added to the cSC,
differentiated into disAGTs in our designed comparison template. . . 47

4.1 The subclasses, e.g., Max implemented in the cSC are hidden in the
mcAM. Also, the superclass InputFunction inherited by these sub-
classes in the cSC is modeled as a class in the mcAM. . . . . . . . . . 53

4.2 A hierarchical inheritance structure with the corresponding additional
two subclasses not in the mcAM is added to the cSC. . . . . . . . . . 53

4.3 In the mcAM, a concept, i.e., decorators is described by a class
ItemDecorator, which might imply the decorator design pattern will
be applied in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 54

xviii


List of Figures

4.4 In the cSC, the concept decorators in the mcAM is described by
the superclass Slot, and the subclasses LeftUpgradeSlot, and Righ-
tUpgradeSlot derived from the superclass Slot. These three classes
are the extension of the interface Icon embedded in the Java library.
This interface Icon is invoked within a View class, which is respon-
sible for communicating with the subclass Upgrade derived from the
superclass Item in the mcAM. . . . . . . . . . . . . . . . . . . . . . . 54

4.5 The naming of the association between the classes LaborBilling and
PhaseLabor is specified with the attribute name lb in the cSC. . . . . 56

4.6 In the cSC, the attribute name lb indeed exits in the class PhaseLabor. 56
4.7 In the mcAM, for the attributes, e.g., email, a default value is not

assigned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.8 A default value “ ” not in the mcAM is added to the attribute email. 57
4.9 In the mcAM, for the subclasses Food and Upgrade, there is a common

attribute image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.10 The common attribute image of the subclasses Food and Upgrade in

the mcAM is upshifted to their superclass Item in the cSC. . . . . . . 58
4.11 Three subcases correspond to this case with their corresponding causes. 61
4.12 Case MR-1 with the corresponding subcases and causes. . . . . . . . 65
4.13 Cases MR-2 and MR-3, with the corresponding causes. . . . . . . . . 66
4.14 In the mcAM, no association is modeled between the classes Labor-

Billing and Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.15 In the cSC, an operation getId() of LaborBilling is invoked in an op-

eration getPhaseLabor(LaborBilling, Phase): PhaseLabor of Pattern.
Yet, the parameter types of the latter operation, e.g., LaborBilling,
are not modeled out in the mcAM (as shown in Figure 4.14). . . . . . 68

4.16 In the mcAM, no relationship is modeled between the classes BaseC-
ontroller and CheckingAccount. Plus, the return type CheckingAc-
count modeled in BaseController in the mcAM can partially imply
an association to exist in the cSC. . . . . . . . . . . . . . . . . . . . 69

4.17 In the cSC, an instance of CheckingAccount is created within the op-
eration buildCheckingAccount(): CheckingAccount of BaseController,
further being returned to this operation. . . . . . . . . . . . . . . . . 69

4.18 In the mcAM, no relationship is modeled between the classes ItemVis-
itor and Food. Yet, an instance of Food, i.e., is specified as a parameter
type of the operation named visit of ItemVisitor in the mcAM. . . . . 70

4.19 In the cSC, for Food, its operation getPrice(): int inherited from the
superclass Item is indeed invoked within visit(Food): void of ItemVis-
itor in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.20 In the mcAM, no relationship is modeled between the classes Single-
Player and TitlePage. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.21 In the cSC, an instance of TitlePage is created within an operation
pausedMainMenu(View view): void of SinglePlayer. . . . . . . . . . . 72

4.22 In the mcAM, no relationship is modeled between the classes User-
Builder and Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

xix


List of Figures

4.23 In the cSC, an instance named userBuilder of UserBuilder is created
within the operation newUser(int, String, String, String) of Control.
Plus, an operation of userBuilder, e.g., setEmail(): String, is further
invoked with the same operation of Control (in the mcAM )/Raise-
MeUp (in the cSC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.24 In the mcAM, no relationship is modeled between the classes Food
and Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.25 In the cSC, an instance of Food (named f ) is specified as a parameter
type of an additional operation removeFood(Food): boolean of Con-
trol/RaiseMeUp. This instance of Food is used in another operation
delFood(f): boolean of another class Dao. Plus, in Dao, an operation
of Food, e.g., getName(), is further invoked with that delFood(Food):
boolean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.26 In the mcAM, no relationship is modeled between the classes Neuron
and NeuralNetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.27 In the cSC, an additional attribute inputNeurons: NeurophArrayList<Neuron>
not in the mcAM is added to the class NeuralNetwork. The opera-
tion size() of Neuron is further invoked within the operation get-
InputsCount(): int of NeuralNetwork. Plus, the instances of Neuron
are contained by not only NeuralNetwork but also by Layer in the cSC. 77

4.28 In the mcAM, no relationship is modeled between the classes Job and
Dao. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.29 In the cSC, an additional attribute jobs: Map<Integer, Food> not
in the mcAM is added to the class DAO and it is further created
in Dao body. An operation put() (an API) is further invoked in the
operation getJob(): Map<Integer, Job> of Dao with the instance
jobs. However, the instances of Job are contained by not only DAO
but also by another class Pet in the cSC. . . . . . . . . . . . . . . . . 78

4.30 The aggregation between Employee and Schedule from the cSC is
modeled as an association in the cSC. Plus, the naming of this asso-
ciation in the mcAM is specified by an attribute name (whose related
type is specified by the instances of Schedule) of Employee from the
cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.31 In the cSC, an attribute type ArrayList<Schedule> contained in the
class Employee means a group of instances of the class Schedule de-
clared in Employee. These instances of Schedule are further created
within the operation Employee(Integer, String, String) of Employee
in the cSC. However, the instances of the class Schedule are contained
not only by the instances of Employee but also by the instances of
another class MainClass in the cSC. . . . . . . . . . . . . . . . . . . . 80

4.32 An aggregation between Income and IncomeRepository from the cSC
is partially indicated by the attribute type (instances of Income) of
IncomeRepository, i.e., List<Income>, in the mcAM. . . . . . . . . . 82

xx


List of Figures

4.33 In the cSC, the type of the attribute named listIncome is specified by
a collection of instances of Income, i.e., ArrayList<Income> in In-
comeRepository . This collection of instances of Income is further
created within an operation IncomeRepository() of IncomeReposi-
tory. Plus, the instances of Income are invoked within an opera-
tion save(Income): void of IncomeRepository in the cSC. Yet, the
instances of Income are contained by not only IncomeRepository but
also another class RegisterIncomeController in the cSC. . . . . . . . . 82

4.34 A composition between the classes SinglePlayer and SinglePlayModel
from the cSC is modeled as an association in the mcAM. . . . . . . . 84

4.35 In the cSC, an instance model of the class SinglePlayerModel that
is modeled out as an attribute type in the class SinglePlayer in the
mcAM is indeed declared as an attribute type of the class Single-
Player in the cSC. This created instance model of SinglePlayerModel
is further invoked within an operation of SinglePlayerModel, e.g., on-
KeyDown(int, KeyEvent): boolean. Plus, in the cSC, this instance
model is exclusive to the corresponding instances of SinglePlayer. . . 84

4.36 A hierarchical inheritance structure in the mcAM. . . . . . . . . . . . 85
4.37 The hierarchical inheritance structure in the mcAM is removed in the

cSC. Solely the superclass Pet in the mcAM is remained yet changed
into a class in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.38 For the class CheckingAccount, the attributes income: Income and
amount: BigDecimal are modeled in the mcAM. . . . . . . . . . . . . 86

4.39 For the class CheckingAccount, all its attributes in the mcAM are re-
placed with fully new attributes incomeRepo: IncomeRepository and
expenseRepo: ExpenseRepository in the cSC. . . . . . . . . . . . . . . 86

4.40 Two causes for case DA-3∗. . . . . . . . . . . . . . . . . . . . . . . . . 87
4.41 In the mcAM, for the attribute named owner, User (a non-primitive

data type) is specified. Also, for its setter operation named se-
tOwner(), a parameter type User is specified. . . . . . . . . . . . . . 90

4.42 In the cSC, for that attribute named owner its specified type User
in the mcAM changed to int instead. Accordingly, for the setter
operation named setOwner() of that attribute, its parameter type
changed from User to int. . . . . . . . . . . . . . . . . . . . . . . . . 90

4.43 In the mcAM, for the class DAO, the operations, e.g., listUser(): Map
is created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.44 In the cSC, the operation, e.g., listUser(): Map created in the class
DAO from the mcAM is moved to another class RaiseMeUp/Controll
as listUsers(): Map<Integer, User>. Then it is used for getting a list
of user data from the Model-related class Dao in the cSC. . . . . . . 92

4.45 Cases DR-1, DR-2, and DR-3∗, with the corresponding causes. . . . . 93
4.46 In the mcAM, a composition is created between the classes Pet and

User. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xxi


List of Figures

4.47 In the cSC, the attribute type whose type is specified by an instance
of User from the mcAM is converted to a primitive data type int.
This leads to the corresponding composition between Pet and User
from the mcAM being removed in the cSC. . . . . . . . . . . . . . . . 94

4.48 In the mcAM, a composition is modeled between the classes Layer
and Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.49 In the cSC, the instances of Neuron are contained by not only Layer
but also NeuralNetWork. . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.50 In the mcAM, an association is created between CheckingAccount and
IncomeRepository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.51 For CheckingAccount, the attributes from the mcAM, e.g., income:
Income, are replaced with fully new attributes, e.g., incomeRepo: In-
comeRepository in the cSC. The specified attribute type, an instance
of IncomeRepository, is created within an operation CheckingAccount()
of CheckingAccount in the cSC. This instance is further invoked within
another operation add(Income): void of CheckingAccount in the cSC.
Furthermore, the instances of IncomeRepository are contained not
only by CheckingAccount but also by another class ValuesCalculator
in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.52 In the mcAMA, an association is created between Pet (origin) PetO-
bserver (target). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.53 RaseiMeUp (target) is linked with Pet (origin); RaseiMeUp (origin)
is linked with PetObserver (target). Thus, an indirect association
between Pet (origin) PetObserver (target) is built up. . . . . . . . . . 98

4.54 Causes for the cases caused by CC. . . . . . . . . . . . . . . . . . . . 99

A.1 Selected mcAM included in project 1. . . . . . . . . . . . . . . . . . . I
A.2 Selected mcAM included in project 2. . . . . . . . . . . . . . . . . . . II
A.3 Selected mcAM included in project 3. . . . . . . . . . . . . . . . . . . III
A.4 Selected mcAM included in project 4. . . . . . . . . . . . . . . . . . . IV
A.5 Selected mcAM included in project 5. . . . . . . . . . . . . . . . . . . V

xxii


List of Tables

4.1 The project background (∼ = around). . . . . . . . . . . . . . . . . . 49
4.2 Suspected cases caused by MA and disAGTs potentially exist in other

voSC(s)/cSC(s) not selected by us previouslt. . . . . . . . . . . . . . 51
4.3 Cases for the differences caused by MA and disAGTs (∗ = own case,

<> = opposite). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 The corresponding example of the case MA-1. . . . . . . . . . . . . . 55
4.5 The corresponding example of the case MA-3∗. . . . . . . . . . . . . . 56
4.6 The corresponding example of the case MA-6. . . . . . . . . . . . . . 59
4.7 Corresponding examples of the causes for the three subcases of case

MO-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.8 Six corresponding examples of the cases titled MO-2, MO-3, MO-4,

MO-5∗, MO-6∗, and MO-7. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 The corresponding example of the case DA-1. . . . . . . . . . . . . . 86
4.10 Two corresponding examples of the causes for case DA-3∗. . . . . . . 87
4.11 The corresponding example of the cases DA-4∗ and DA-5∗ . . . . . . . 88
4.12 Four corresponding examples of the cases DO-1, DO-2, DO-3 and DO-5. 91
4.13 Seven cases of the differences caused by CC. . . . . . . . . . . . . . . 98
4.14 Three corresponding examples of the causes for case CC-1. . . . . . . 101
4.15 Two corresponding examples of the causes for case CA-1. . . . . . . . 101
4.16 Two corresponding examples of the causes for case CA-2. . . . . . . . 102
4.17 Four corresponding examples of the causes for case CO-1. . . . . . . . 104
4.18 Two corresponding examples of the causes for case CO-2. . . . . . . . 105
4.19 The corresponding example of the cause for case CO-3. . . . . . . . . 105
4.20 Two corresponding examples of the cause for case CO-4 (∗ = partic-

ular interest). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.21 Ratios of the cases with involved projects (∗ = own case). . . . . . . . 107
4.22 Typical/common cases with related projects. . . . . . . . . . . . . . . 108
4.23 Project 1 - Respective differentiated cases of MA, disAGTs, and CC

(∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.24 Project 2 - Respective differentiated cases of MA, disAGTs, and CC

(∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.25 Project 3 - Respective differentiated cases of MA, disAGTs, and CC

(∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.26 Project 4 - Respective differentiated cases of MA, disAGTs, and CC

(∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.27 Project 5 - Respective differentiated cases of MA, disAGTs, and CC. . 110

xxiii


List of Tables

xxiv


List of Acronyms

Uncommon acronyms
mcAMs Manually created architectural models
voSC Version(s) of source code
cSC Conformable source code
MA Manual abstraction
disAGTs Disagreements
CC Common changes

General acronyms
SDLC Software Development Life Cycle
OOP Obeject-Oriented Programming
UML Unified Modeling Language
FOSS Free/Open Source Software
API Application Programming Interface
EA Enterprise Architect
IDEA IntelliJ IDEA
AST Abstract Syntax Tree
MVC Model-View-Controller
MVVM Model-View-ViewModel
OCL Object Constraint Language
MDE Model-Driven Engineering

xxv


1
Introduction

1.1 Background
The software development life cycle (SDLC) is defined by Stocia et al. [2] as “an
environment that describes the activities performed in each stage of the software
development process.” In the early stages of the SDLC, Unified Modeling Lan-
guage (UML) is widely used to model and visualize system artifacts [3]. Among the
various models of UML, the class diagram is the most commonly used and impor-
tant diagram in object-oriented system modeling [4]. The advantage of using class
diagrams in practice is that it can help software engineers—both developers and
maintainers—to understand systems architectures, behaviors, design choices, and
implementations [5]. Thus, it is easier for developers and maintainers to understand
the structure of the system by looking at the class diagram rather than reading
through the code in detail.

1.2 Challenges
Class diagrams model the information on the domain of interest in terms of objects
organized in classes and relationships between them [6]. They are intensively used in
the early stages of the SDLC to present the system’s structure. Maintainers benefit
from using class diagrams to understand the system’s structure, and thus the places
required to be modified can be located [7]. However, there are three challenges re-
vealed by other authors that are likely to cause the implementation of the source
code to deviate from the design of the class diagram. There is a concern that such
divergent class diagrams cannot help developers and maintainers to understand the
structure of the system in the same way.

These three challenges affect the activities performed by software engineers in dif-
ferent stages of the SDLC. Note that the stages of the SDLC vary depending on
the source [2, 8, 9]. Figure 1.1 illustrates five of the stages: analysis, design, imple-
mentation, testing, and maintenance. This thesis focuses on three of them which
involve the three challenges: design, implementation, and maintenance, respectively
(as shown in Figure 1.1, stages filled in blue). Architects and their teams, developers,
and maintainers are involved in each of these three stages. A detailed description of
the three challenges depicted in Figure 1.1 is as follows:

• Challenge 1 - Creating various levels of manual abstraction on the elements of

1


1. Introduction

the class diagram during the design stage: Osman [p. 45, 10] proposed that
class diagrams with a low level of detail are used to show a high-level abstrac-
tion of the structure of the system. However, little is known about which level
of abstraction the architects and their teams create during the design phase.

• Challenge 2 - Missing or unfollowed parts of the class diagram during the
implementation stage: Guéhéneu [5] proposed that class diagrams produced
during the design stage are often forgotten during the implementation stage,
under time pressure usually. Truong et al. [11] investigated that many created
designs are only partially followed during implementation. Thus, missing or
unfollowed parts of the class diagram will cause the implementation of the
source code to deviate from the design of the class diagram.

• Challenge 3 - Missing updates of the class diagram during the maintenance
stage as the code evolves: Osman et al. [12] proposed that the frequency of
updating UML models is low, and a new feature of the system is introduced in
a new version/release, which should result in an update of the class diagram.
We agree with the assertion by Osman et al. [13] that keeping diagrams up
to date with code evolution is often a challenge. Figure 1.1 illustrates that
code evolves over time from implementation to maintenance. However, if the
class diagram remains static (without update) as the code evolves, it does not
reflect the new features introduced to the system.

Figure 1.1: Three challenges involved in the staged SDLC are likely to cause the
implementation of the source code to deviate from the design of the class diagram;
thereby, differences between the class diagram and the source code are introduced.

2


1. Introduction

1.3 Goal and Motivation
As aforementioned, the divergent class diagrams cannot be used in the same way
by software engineers for understanding the system’s implementation structure. As
a solution, reverse engineering methods/tools can reverse code into class diagrams.
Yet, the reverse-engineered class diagrams, in most cases, are not abstract and with
extensive information which will burden the software engineers’ understanding of
the system’s implementation structure. This is due to the inability to provide input
of manual abstraction characteristics for reverse engineering tools/methods. Thus,
the tools/methods cannot manage to imitate humans to abstract relevant informa-
tion only. Yet, the characteristics of manual abstraction can only be achieved by
flexible manual studies based on the fact humans can jointly interpret the semantics
conveyed by different model elements based on a full understanding of the relevant
code implementation. Thus, the goal of this thesis aims at manually discovering the
characteristics of manual abstraction created in the model elements.

The goal of this thesis is motivated by the following holds:

No actual case of the models and source code is studied in terms of
manual abstraction characteristics: So far as we know, the existing studies on
the characteristics of manual abstraction are based on the opinions and experiences
of the participants, yet do not study an actual case of models and source code. The
consistency checks of the differences and similarities between the design and code
are purely structural and do not take semantics conveyed by model elements into
account. Yet, semantics conveyed by model elements is key for studying the manual
abstraction characteristics. Given that different systems have their own specified
implementation structures, the desired functionalities require code structures that
are interrelated and cooperated while also taking into account the application of
the specific architecture and design patterns. These factors need to be considered
jointly, which can only be achieved by flexible manual studies of models and source
code.

1.4 Case Study Subjects
In order to investigate the characteristics of the differences between the source code
and class diagram, we employed the methodology of five case studies. It is necessary
to access a dataset that includes a set of projects with class diagrams and corre-
sponding source code. However, such a dataset is rare and difficult to access since
the industrial models and source code is often not accessible for research. To address
this issue, we decided to make use of models used in Free/Open Source Software
(FOSS) projects. Thus, we used Lindholmen Dataset [1] created by Hebig et al. to
do this thesis work.

The Lindholmen Dataset [1] holds 3 295 open source projects of GitHub, which in-
clude together 21 316 UML models. The model files are in two formats (images and

3


1. Introduction

standard files). We first selected 5 projects in Java programming languages from
that dataset as our study subjects. Then the selected class diagrams introduced to
the project are limited to image format only, and we referred to them as manually
created architectural models (mcAMs). The model elements we studied are
classes, attributes, operations, and relationships (i.e., dependencies, usages, associ-
ations, aggregations, compositions, inheritances, and realizations).

1.5 Definition of Terminologies
We defined the following terminologies or used definitions given by others, enabling
the reader to understand the research questions (RQs) formulated in section 1.6.

As defined in the source [14], commit is a snapshot of changes made to the staging
area, where holds the files to be included in the next commit.

Version(s) of source code (voSC) is represented by a collection of source code
files found in the repository after a commit (before next commit).

Concepts are described by classes created in the mcAM. To be specific, a concept
can be described by an abstract (super-) class of the class or a non-abstract (normal)
class of the class.

Note that regarding the definition of concepts, one can argue that concepts can be
described by both classes and relationships between these classes from the mcAM.
However, we argue that concepts are described only by classes from the mcAM.
This is because the differences in attributes and operations from the classes between
the mcAM and the source code would lead to differences in relationships. These
correlated differences we aim to study.

Map to a voSC refers to mapping one or more concepts described in the mcAM
to one or more classes of that voSC. Note that human judgments are involved in the
mapping since the naming of classes in the mcAM and source code might be different.

A voSC v is conformable to the mcAM if for every concept described in the
mcAM the following holds: for the concept a there is a Map to the voSC v, or there
is a second concept b in the mcAM described by the superclass of the class that
describes the concept a and for this second concept b there is a Map to that voSC
v.

The latter case can be illustrated by the following example:

Example: This example is taken from the repository of one of our study projects,
i.e., ZooTypers [15] on GitHub. As seen in Figure 1.2, there are five concepts in
the mcAM: Pet, Dog, Cat, Fish, and Penguin, which are described by the superclass
Pet, and four subclasses Dog, Cat, Fish, and Penguin that are derived from that
superclass Pet, respectively. Figure 1.3 illustrates that a voSC that only includes

4


1. Introduction

a class Pet that can be mapped to the superclass Pet of the class, e.g., Dog in the
mcAM is conformable to the mcAM, since for the concept, e.g., Dog, there is a
second concept Pet described by the superclass Pet of the class Dog that describes
the concept Dog, and for this second concept Pet, there is a Map to that voSC.

Figure 1.2: In the mcAM, for the
concept, e.g., Dog, there is a second

concept Pet is described by the
superclass Pet of the class Dog that

describes the concept Dog.

Figure 1.3: For the superclass Pet of
the class Dog that describes a second
concept Pet of a conceptDog in the
mcAM, the class Pet in this voSC is

considered to be mapped to that
superclass Pet of the class Dog in the

mcAM.

Conformable source code (cSC) refers to a voSC that is conformable to the
mcAM.

1.5.1 Examples of Mappings between mcAM Concepts and
voSC

To illustrate how mappings are created between mcAM concepts and voSC, the fol-
lowing examples are brought out:

Note that regarding the following examples, for a concept a, the naming of the class
b that describes the concept a in the mcAM and the naming of the mapped class c of

5


1. Introduction

the class b in the voSC v might be different. However, class b and class c are ontolog-
ically identical since they describe the same concept with highly similar attributes
and operations and thereby, class b in the mcAM is considered to be mapped to
class c in the voSC v.

Case 1 - A class that describes a concept in the mcAM can be mapped to one class
in the voSC.

Example of Case 1: This example is taken from the repository of one of our study
projects, i.e., ZooTypers [15] on GitHub. As observed in the comparison between
Figure 1.4 and Figure 1.5, the naming of the class, i.e., SinglePlayModel in the mcAM
differs from the naming of the class i.e., SinglePlayerModel in the voSC. This might
be due to a misspelling of the name of the class SinglePlayModel in the mcAM.
However, the class SinglePlayModel in the mcAM and the class SinglePlayerModel
in the voSC are ontologically identical since they both have highly similar attributes
and operations and thereby these two classes are considered to describe the same
concept SinglePlayerModel. Therefore, the class SinglePlayModel in the mcAM is
considered to be mapped to the class SinglePlayerModel in the voSC.

Figure 1.4: A class SinglePlayModel
describes a concept SinglePlayerModel
in the mcAM, yet the naming of the

class SinglePlayModel with a
misspelling.

Figure 1.5: The class
SinglePlayerModel in this voSC has

highly similar attributes and operations
with the class SinglePlayModel in the

mcAM and thus, the class
SinglePlayModel is considered to be

mapped to the class SinglePlayerModel
in this voSC.

Case 2 - A class that describes a concept in the mcAM can be mapped to more
than one class in the voSC.

Example of Case 2: This example is taken from the repository of one of our study
projects, i.e., RaiseMeUp [16] on GitHub. Figure 1.6 illustrates that a concept dec-
orators is described by a class ItemDecorator in the mcAM. Decorators are part of
the decorator design pattern [17]. Thereby, the naming of the class ItemDecorator

6


1. Introduction

and the relationships between ItemDecorator and the subclass Upgrade derived from
the superclass Item possibly imply the decorator design pattern is applied in this
project. One or more decorators of the decorator design pattern might be further
planned to decorate the subclass Upgrade derived from the superclass Item in the
mcAM.

To confirm whether the decorator design pattern is applied in the voSC and the con-
cept decorators created in the mcAM remains in the voSC, we checked the detailed
implementation of the voSC illustrated in Figure 1.7. Then we can know that the
classes Slot, LeftUpgradeSlot, and RightUpgradeSlot are indeed the classes related
to decorators of the decorator design pattern. Also, these classes are the extension
of the interface Icon embedded in the Java library. Note that a MVC architectural
pattern is adopted in the voSC. With this as a basis, in the voSC, the subclass
Upgrade derived from the superclass Item is a Model-related class, and the interface
Icon is invoked in a View-related class that is responsible for communicating with
that subclass Upgrade. Therefore, these three classes in the voSC are considered to
be related to decorators and are considered to be mapped to that class ItemDeco-
rator in the mcAM since the concept decorators indeed remains in the voSC, and
thereby described by those three classes.

Figure 1.6: In the mcAM, a concept,
i.e., decorators is described by a class
ItemDecorator, which might imply the

decorator design pattern would be
applied in the voSC.

Figure 1.7: In this voSC, the concept
decorators in the mcAM is described by
the superclass Slot, and the subclasses
LeftUpgradeSlot, and RightUpgradeSlot
derived from the superclass Slot. These

three classes are the extension of the
interface Icon embedded in the Java

library. This interface Icon is invoked
in a class related to View, which is

responsible for communicating with the
subclass Upgrade derived from the

superclass Item in the mcAM.

Case 3 - In the mcAM, for concept a, there is a second concept b described by the
superclass of the class that describes the concept a and for this second concept b,
that superclass in the mcAM can be mapped to one class in the voSC.

7


1. Introduction

Example of Case 3: This example is taken from the repository of one of our study
projects, i.e., RaiseMeUp [16] on GitHub. Figure 1.8 illustrates that in the mcAM,
for the concept, e.g., Dog, a second concept Pet is described by the superclass Pet
of the class that describes the concept Dog. For the second concept Pet, there is a
Map to this voSC, i.e., mapping that superclass Pet in the mcAM to one class Pet
in the voSC (see Figure 1.9).

Figure 1.8: In the mcAM, for the
concept, e.g., Dog a second concept Pet

is described by the superclass Pet of
the class Dog that describes the

concept Dog.

Figure 1.9: This voSC only includes
one class Pet that can be mapped to
the superclass Pet of the class, e.g.,

Dog in the mcAM.

Case 4 - A class that describes a concept a in the mcAM can be mapped to an
additional superclass (not in the mcAM, but added to the voSC) that describes a
concept a in the voSC (ignoring that one or more additional subclasses derived from
that superclass (not in the mcAM, but added to the voSC as well) describe one or
more additional concepts (not in the mcAM, but added to the voSC)).

Example of Case 4: This example is taken from the repository of one of our
study projects, i.e., Neuroph [18] on GitHub. Figure 1.10 illustrates that a concept,
i.e., InputFunction is described by a class InputFunction in the mcAM. Figure 1.11
illustrates that this concept InputFunction remains in the voSC and is described by
one additional superclass InputFunction (not in the mcAM, but added to the cSC).
Therefore, the class InputFunction in the mcAM is considered to be mapped to that
superclass InputFunction in the voSC.

8


1. Introduction

Observed from the comparison between Figure 1.10 and Figure 1.11, an additional
concept, e.g., Max not in the mcAM is added to the voSC and is described by an
additional subclass Max (not in the mcAM is added to the voSC). For this additional
concept Max, there is a second concept InputFunction described by the superclass
InputFunction of the class Max that describes the additional concept Max in the
voSC and for this second InputFunction, the superclass InputFunction is considered
to be mapped to the class InputFunction in the mcAM.

Figure 1.10: A concept, i.e.,
InputFunction is described by a class

InputFunction in the mcAM.

Figure 1.11: The concept
InputFunction in the mcAM is

described by a superclass
InputFunction that not in the mcAM

but added to the voSC.

1.5.2 Ideal Selection of One cSC among Multiple cSCs of a
mcAM

In a project’s GitHub repository, commits are throughout the SDLC as the code
evolves. Thus, a project has different voSC as the code evolves. A mcAM may
have one or more cSCs, i.e., one or more voSCs that are conformable to the mcAM.
However, considering the time constraints and we want to study more projects, we
decided to select only one cSC of them. Compared with other cSCs of the mcAM,
this cSC should ideally cover the most attributes and operations associated with the
concepts in the mcAM. However, this cannot be guaranteed since it is not possible
for us to check the voSC one by one (referring to the detailed methodology employed
illustrated in section 3.2). This will lead to a threat, which will be illustrated in
section 5.6.

To illustrate how one cSC among multiple cSCs for a mcAM is selected, the follow-
ing example is given:

This example is taken from the repository of one of our study projects, i.e., EAPLI_PL_2NB
[19] on GitHub. Figure 1.12 illustrates in the mcAM there are eight concepts that
are described by eight classes, i.e., BaseUI, BaseController, IncomeRegisterUI, Reg-
isterIncomeController, Income, IncomeTyperRepository, CheckingAccount, and In-
comeRepository, respectively.

9


1. Introduction

Figure 1.12: In the mcAM, eight concepts are described by eight classes, i.e.,
BaseUI, BaseController, IncomeRegisterUI, RegisterIncomeController, Income,
IncomeTyperRepository, CheckingAccount, and IncomeRepository, respectively.

10


1. Introduction

Below are two examples of two cSCs of the mcAM depicted in Figure 1.12. As
observed in the comparison between Figure 1.13 and Figure 1.14, these two cSCs
cover all eight concepts in the mcAM. The only difference between these two cSCs is
the cSC represented by Figure 1.14 covers four more operations (underlined in blue)
than the cSC represented by Figure 1.13. Thus, we would ideally want to select the
cSC illustrated in Figure 1.14.

Figure 1.13: A cSC of the mcAM
covers all eight concepts in the mcAM.

Figure 1.14: Compared with the cSC
illustrated in Figure 1.13, this cSC of

the mcAM covers four more operations.

1.6 Research Questions
To reach the goal of this thesis of studying the characteristics of manual abstraction,
we formulated the following research questions:

RQ1: Does the cSC of the mcAM cover all elements planned out in that mcAM?
As mentioned in section 1.5, if a mcAM has multiple cSCs, we will select only one
cSC of them. A possibility is that the selected cSC cannot cover all elements (i.e.,

11


1. Introduction

attributes and operations) planned out in that mcAM. This will cause the differ-
ences between the cSC of that mcAM and that mcAM.

RQ2: What causes the differences between the cSC of the mcAM and that mcAM?
With the answer to RQ2, we can get a list of cases that cause the differences between
the cSC of the mcAM and that mcAM. Could these cases be categorized into some
common causes?

RQ3: What are the differences between the cSC of the mcAM and that mcAM?
With the answer to RQ3, we can conclude some common causes that cause the
differences between the cSC of the mcAM and that mcAM.

1.7 Contributions
The results of this thesis provide a sorted list of cases that cause the cSC to devi-
ate from the mcAM and a sorted list of suspected cases inferred by these observed
cases. These suspected cases are considered to exist possibly and would also lead
to the differences between the mcAM and the cSC. In accordance, this thesis will
have the following implications for the reverse engineering and modeling community:

1. To provide input for future improvement of reverse engineering meth-
ods/tools in terms of abstractness: So far as we know, reverse engineering is
imperfect as it does not manage to imitate the human ability to abstract relevant
information from the source code. Guéhéneu [5] proposed that no existing main-
stream reverse engineering tool produces abstract yet precise class diagrams. The
concluded cases of creating various manual abstraction on the model elements can
be used as such input.

2. To provide input for developing mapping rules which can be used for
the consistency check(s) between the model design and code implementa-
tion: Existing methods/technologies developed for the consistency checks between
code and design are purely structural and do not take the semantics conveyed by the
model elements into account. Yet, the semantics is closely related to the manual ab-
straction characteristics. Thereby, the sorted list of cases of the differences between
the code and design, which was yielded from manually studying five Java projects
based on interpreting the semantics of the model elements, can provide such input.

3. To provide a guideline for designing model elements to avoid over-
abstraction and over-specification: Given the abstract nature of the model, the
model elements can be modeled at various levels depending on the design decisions
made by architects. When it comes to design decisions for creating different elements
of the class diagram during the design stage, little is known about which design
decisions are inclined to be acceptable and unacceptable by developers in the code
implementation. This can result from over-specifying the model elements yet losing
the abstractness. Thereby, the developers disagree with these design decisions made
by architects, and they make different decisions in the code implementation. On

12


1. Introduction

the other hand, this can also result from over-abstracting the model elements. This
leads to vague design decisions that cannot be accepted by developers in the code
implementation, given that the developers need to settle these vague design decisions
down and further specify the detailed implementation for these decisions. Then the
deviations of the code from the design come up. Thus, the concluded cases of the
differences caused by developers’ deviations from the architects’ design decisions
allow us to create this guideline.

1.8 Structure of the Paper
This thesis presents a systematic manual study of the characteristics of the differ-
ences between the mcAM and one cSC of that mcAM by analyzing five open-source
Java projects on GitHub. The structure of this thesis is outlined as follows: In
Chapter 2, the relevant theoretical knowledge and early research done by others are
described. Chapter 3 details the methodology of the five case studies employed. The
results of this thesis work are illustrated in Chapter 4. The threats to the validity of
this thesis are described and discussed in Chapter 5. This thesis work is concluded,
and the future work of this thesis is suggested in Chapter 6.

13


1. Introduction

14


2
Theory and Related Work

In this chapter, in order to help the reader understand the work of this thesis,
the relevant theoretical knowledge of Java and UML is first described. This lays the
ground for understanding the constituents of a mcAM and thereby every constituent
in that mcAM can be mapped to the corresponding constructs of one cSC of that
mcAM. After that, related early work done by other authors on reverse engineering
and models in open-source systems is presented and discussed. The former work
provided the inspiration for this thesis and from which this thesis originated. The
study subjects of this thesis relied on the outcomes of the latter work.

2.1 Theory
In order to detect the differences between the mcAM and the cSC of that mcAM,
for each model constituent in the mcAM, there must be a map to the corresponding
construct(s) of the cSC of that mcAM. Only with knowledge of Java and UML can
one understand how to map every element in a mcAM to the corresponding con-
struct(s) of a cSC of that mcAM.

As Java and UML evolve, multiple versions of their specifications exist at different
times. On the other hand, the mcAM and the first voSC found in the repository of
each project were created at different times. For the five projects studied, in order
to ensure that these mcAMs and voSCs included in the projects matched the appro-
priate versions of Java and UML specifications, respectively, it is critical to identify
when the earliest mcAM and voSC were created in the repository. This is because
the creation date of the Java specifications and UML specifications used should be
ideally as close as possible to the creation date of the earliest created mcAM and
voSC, and in turn, later versions have new updates that may not adapt to these
mcAMs and voSCs. After checking, the earliest mcAM and voSC were created on
July 8, 2011, and August 24, 2011, respectively. In consequence, Java SE7 speci-
fications (released in July 2011) [20] and UML v2.4.1 superstructure specifications
(released in July 2011) [21], respectively, were selected as the basis for this study.

2.1.1 Related Java Knowledge
Relevant Java knowledge needs to be understood, including Java syntax/specifications
and the OO paradigm. Of particular interest part Java specifications, along with
several related concepts in the OO paradigm, which can help to understand how

15


2. Theory and Related Work

abstraction is able to be created over the cSC.

2.1.1.1 Java SE7 Specifications

Of particular interest part of Java SE7 specifications is illustrated in the following:

Framework is a set of classes and interfaces which provide a ready-made architec-
ture [22].

Collection framework provides the ready-made classes and interfaces needed to
represent a group of objects (also called instances) as a single entity in Java [22].
For example, the Map interface with the corresponding classes, e.g., HashMap, is
used to present a group of instances.

2.1.1.2 Object-Oriented (OO) Paradigm

Several related concepts of the OO paradigm are described in the following, accord-
ing to the source [20, 23, 24, 25, 26, 27].

Object are often referred as an instance or an array of a class in Java and all
objects created belong to a certain class [20, 25]. Objects are an encapsulation of
information and behavior relative to some entity of the application domain under
consideration [25]. In real systems many objects with similar information (data)
and behavior (functionality) can be found [25].

Class captures those objects with similar information (data) and behavior (func-
tionality) and classes can be viewed as an abstract data type [25]. Class is defined as
including at least two types of features: attributes (also called variables, fields or
data members), which stand for the stored information and methods (also called
operations or function members), which represent the behavior [25].

Encapsulation is a technique for minimizing interdependencies among separately-
written modules by defining strict external interfaces [26]. An encapsulated module
can only be accessed by clients (that is, other modules that make use of this module)
via this interface [27]. Implementation details are “hidden” within the module. The
primary reason for requiring encapsulation is to make it possible to change (improve)
the implementation of a module without having to change (and/or recompile) the
module’s clients [27].

Take Java as an example. For the encapsulation of attributes included in a class,
all attributes about that class should be set to private unless they are specifically
declared public [25]. The public setter and getter operations set for the attributes of
a class are called its interfaces and should only be the “tip of the iceberg” with the
hidden part that is called the implementation [25]. Those interfaces of a class allow
the supplier class to render the values of those attributes to the customer class [25].

16


2. Theory and Related Work

Inheritance allows the subclass that extends the superclass to be arranged in a
hierarchical structure [24, p. 451], and thereby a subclass to take on the general at-
tributes and operations of that superclass in the inheritance chain so that attributes
and operations then form part of the definition of the subclass for code reuse [23,
p. 63].

2.1.2 Unified Modeling Language (UML)
UML is a modularly structured language that can provide specific components of
primary interests for a specific domain or application [21, p. 1]. UML is a de facto
standard formalism for software design and analysis [6]. With some existing case
tools such as Enterprise Architect [28] and IntelliJ IDEA [29], specific constituents
of UML can be handily visualized to accommodate the specific requirements. Of
particular interest are class diagrams used for modeling the information on the do-
main of interest in terms of objects (instances) organized in classes and relationships
between them [6]. Thus, the specific constituents of UML that are most likely to
be required in most cases for constructing a class diagram are classifiers (classes),
classifiers’ (classes’) embedded text notation for attributes and operations, and rela-
tionships between classifiers (classes). As mentioned above, they can all be visualized
with a case tool; these constituents are detailed separately in this section.

2.1.2.1 Graphical Notation for Classifiers

Classifier refers to a classification of instances describing a set of instances that
have features in common [21, p. 51], in which the textual notation for attributes
and operations is embedded.

Figure 2.1 presents examples of graphical notation for a class Window at three dif-
ferent levels of abstraction: suppressed (on the top left corner), analysis (on the
right), and implementation (on the bottom left corner) [21, p. 50]. However, for
different systems, at which level or in the fluctuations between these levels a class is
constructed actually depends on different design decisions made by different archi-
tects during the design stage. In some cases, they may determine to model a specific
part of the system that is of primary interest at a low level (e.g., an implementation
level) to give developers more insight into that part of the system during the imple-
mentation stage. On the contrary, for the part of a system that is of less interest,
they may determine to model that part of the system at a higher level relative to
the level of implementation (e.g., a suppressed or an analysis level).

As seen in Figure 2.1, if the class Window is constructed at an analysis or implemen-
tation level, details of its embedded textual notation for attributes and operations
are (more or less) laid out. However, the meta-textual notation defined for them in
[21] is far more comprehensive than any of the three illustrated in Figure 2.1. Take
Figure 2.1 as an example, aiming at leaving the reader with an impression of what a
class is possibly like at various abstraction levels. That means in general, the layout
of a graphical class is composed of three primary sections. These three sections are
elaborated in the following with the aid of a given example shown on the right of

17


2. Theory and Related Work

Figure 2.1.

• Upper section (mandatory): Contains the name of the class, e.g., Window
[30].

• Middle section (optional): Contains one or more attributes of the class
Window, and they are used to describe the qualities of Window [30], e.g., the
attribute size: Area is used to describe the size of an instance of the class
Window. Noteworthy, this section is only required when describing a specific
instance of a class [30].

• Bottom section (optional): Includes class operations displayed in list for-
mat, each operation, e.g., hide() takes up its own line [30]. The opera-
tions describe how a class interacts with data [30], e.g., for the operation
attachX(xWIN: XWindow), the class Window references class XWindow as a
parameter data type, and thereby an interaction between the class XWindow
and the class Window comes out.

Figure 2.1: Three abstraction levels of graphical class notation: suppressed (on
the top left corner), analysis (on the right), and implementation (on the bottom

left corner) [21, p. 50]. Compared with the suppressed level with only the specified
name of a class, another two levels present a more comprehensive layout of a class

with more or less) textual notation for attributes and operations embedded.

2.1.2.2 Conversions of BNF

To standardize the textual notation for attributes and operations embedded in the
classifier, legal formats are first specified, i.e., the Backus-Naur Form (BNF) con-
versions (as depicted in Figure 2.2). The legal formats make the textual notation
for attributes and operations more easily interpreted.

Note that the specification of BNF conversions applies to both earlier serial UML
v1.0 specifications series and the latest serial v2.0 specifications series.

18


2. Theory and Related Work

Figure 2.2: Specification of Backus-Naur Form (BNF) conversions [21, pp. 16–17].

2.1.2.3 Textual Notation for Attributes

Reference to [21, pp. 129–130], the notation for attributes defined is depicted in
Figure 2.3.

Note that attributes are a legacy terminology of the earlier UML v1.0 specifications
and are referred to as properties in the UML v2.4.1 superstructure specifications.
Property (denoted in Figure 2.3) and attribute are ontologically identical. Termi-
nology attributes are used in this thesis.

Figure 2.3: Textual notation for attributes [21, pp. 129–130].

19


2. Theory and Related Work

This thesis work focuses on five constituents of the attributes’ textual notation:
name, prop-type, multiplicity, default and visibility [21, p. 129] (as depicted in Fig-
ure 2.3, underlined in pink). These five constituents are referred to as name, attribute
type, multiplicity, default value, and visibility, respectively, in this thesis. The def-
initions of these constituents used in this thesis and the reasons why they are of
interest are detailed in the following.

Note that although their corresponding definitions are depicted in Figure 2.3, some
of them may still need to refer to earlier UML v1.5 specifications [31]. In this way,
these two versions of definitions are able to complement each other. This will en-
hance the comprehensibility of those constituents.

• Reference to the definition of name in the UML 1.5 specifications, Chapter 3,
Part 5, Section 3.25 “Attribute”, name is an identifier string, usually a simple
word, to represent an attribute [31, p. 42].

Names are mandatory for attributes’ textual notation. Names are not mere
identifiers for attributes; in particular, they carry relevant semantics related
to the static data structure of the classifier [32]. Semantics preservation is a
main objective of the refinement of design into code [32]. Thereby, attributes
names are expected to enhance our comprehensibility on mappings between
the attributes in the mcAM and cSC. That means based on the semantics
related to the attributes in the mcAM, their corresponding attributes in the
cSC that represent similar semantics are able to be identified.

Note that considering attributes in the mcAM can be referred to as either
variables or constant variables in the cSC. In accordance, an attribute in the
mcAM can be modeled as as variable or a constant variable that will be im-
plemented in the cSC. Referring to Java naming conversions, the naming of
variables should be in camel case [33] and the naming of the declared class
constant variables should be all in uppercase letters with words separated by
underscores (“_”) [34]. These naming conversions are represented in the same
way as they are in the mcAM. Thus, by observing the naming conversions of
attributes in the mcAM, one can infer whether an attribute in the mcAM can
be mapped to a variable or a constant variable in the cSC.

• Reference to UML v1.5 specifications, Chapter 3, Part 5, Section 3.25 “At-
tribute” [31, p. 42], attribute type refers to either name of the classifier or a
language-dependent string that maps into a primitive data type in Java.

There are two reasons for us to study the differences in attribute types. The
first reason is that the attribute types might be related to relationships between
classifiers (also instances). For example, the relationships of association, ag-
gregation, and composition between classifier A and B should first satisfy that
classifier A references classifier B as the type of an attribute included in clas-
sifier A. Then, to confirm exactly the relationship between classifiers A and

20


2. Theory and Related Work

B, based on the definitions of relationships in this section, it demands us to
check the detailed implementation of the cSC.

Although the attribute type (name of the classifier) is critical for understand-
ing the relationships between classifiers, the attribute type is an option (as
illustrated in Figure 2.3). In essence, the possibility of omitting the attribute
types by architects when designing the mcAM cannot be excluded. This is the
second reason. The omission of the attribute types is a kind of abstraction.
Opposite to this case, there would be two other cases caused by developers’
disagreements in the implementation of the cSC with respect to the abstrac-
tion created by architects in the mcAM. These two cases are 1. The attribute
type in the mcAM is removed in the cSC. 2. The attribute type in the mcAM
and cSC is specified differently.

• Multiplicity is specified in the textual notation for both attributes and op-
erations. Thereby, its definition and the reason why it is a focus in this thesis
are detailed separately in subsubsection 2.1.2.6.

• Default value is an expression that evaluates to the default value or values
of the attribute [21, p. 113]. Referring to the notation for attributes depicted
in Figure 2.3, the default value is an option. Thus, assume a possibility that
when designing attributes/variables in the mcAM, some architects may intend
to omit the specification of default values and choose to leave them out for de-
velopers to initialize the variables in the cSC. The reason for this is that these
architects may have taken into account the need to cater to new requirements
in the future, and thus the values of variables will be updated by developers
one or more times during the implementation of the cSC. This will cause the
default value not in the mcAM to be added to the cSC.

Opposite to the omission of the default values of the attributes (variables),
some architects may over-specify the default values of attributes (variables) in
the mcAM. However, developers may disagree with this, and they choose to
remove the specified values in favor of other approaches, such as adopting pub-
lic setter operations of these attributes (variables) to initialize the attributes
(variables) and update their values one or more times in the cSC. Note that
the prerequisite for using public setter operations is that the attributes (vari-
ables) declared in the classifier should be set to private.

On the other hand, for the design of attributes (constant variables) in the
mcAM, considering that a constant variable should be assigned a value once
across the life-cycle of the program, a query is whether the architects intended
to specify a default value for a constant variable in the mcAM or not.

As mentioned previously, there would be a case - the differences in attributes
caused by the conversion of variables and constant variables between the
mcAM and cSC. Thus, another query as to whether this case has an impact on

21


2. Theory and Related Work

the specification of default values in the mcAM and cSC. As we all know, the
variables are likely to be updated one or more times in the cSC. In particular,
assume that variables in the mcAM, which may not hold default values, are
converted to constant variables with default values assigned in the cSC. This
will lead to another case - the default value not in the mcAM is added to cSC.
However, differing from the former case, this case is considered to be caused
by the deviations between the implementation of the cSC and the design of
the mcAM.

• Visibility is specified in the textual notation for both attributes and opera-
tions. Thereby, its definition and the reason why it is a focus in this thesis are
detailed separately in subsubsection 2.1.2.5.

Except for the five constituents of attributes’ textual notation illustrated above, the
other two constituents of ‘/’ and prop-modifier(attr-modifier) are not considered to
be the focus due to the absence of the relevant cases in the five projects studied.

2.1.2.4 Textual Notation for Operations

Reference to [21, pp. 107–108], the textual notation for operations defined is de-
picted in Figure 2.4.

Figure 2.4: Textual notation for operations [21, pp. 107–108].

This thesis work focuses on four constituents of the operations’ textual notation:

22


2. Theory and Related Work

name, parameter-list, return-type, and visibility [21, p. 107] (as shown in Figure 2.4,
underlined in pink). They are referred to name, parameter list, and return type, and
visibility, respectively, in this thesis. The definitions of these constituents used in
this thesis and the reasons why they are of interest are detailed in the following.

Note that although their corresponding definitions are depicted in Figure 2.4, some
of them may still need to refer to earlier UML v1.5 specifications [31]. In this way,
these two versions of definitions are able to complement each other. This will en-
hance the comprehensibility of those constituents.

• Reference to the definition of name in the UML 1.5 specifications, Chapter 3,
Part 5, Section 3.26 “Operation”, name is defined as an identifier string to
represent an operation [31, p. 44].

Names are mandatory for the operations’ textual notation. Names are not
mere identifiers for operations; in particular, they carry relevant semantics
related to the behavioral status of the classifier [32]. Semantics preservation
is a main objective of the refinement of design into code [32]. Thereby, op-
erations names are expected to enhance our comprehensibility on mappings
between the operations in the mcAM and cSC. That means based on the se-
mantics related to the operations in the mcAM, we are able to identify their
corresponding operations in the cSC that represent similar semantics. Some-
times, the semantics of the operations in the mcAM might remain in the cSC.
However, for the operations, their implementation in the cSC might deviate
from their design in the mcAM. This is because developers disagree with the
design decisions of operations made by architects in the mcAM; rather, they
make different decisions of operations in the cSC. The characteristics of such
deviations in operations between the mcAM and cSC are what this thesis aims
to study.

• Parameter list is defined as a list of parameters of the operation [21, p. 108]
(as depicted in Figure 2.4).

Considering Java syntax, i.e., the construct of an operation, e.g., a parame-
ter cannot be specified with a default value. Thereby, the default value will
be excluded in this thesis work. For the parameter, none of the five projects
studied have relevant cases with respect to the constituents (i.e., parm-property
and direction). Thus, these two constituents will also be excluded. To this
end, only three constituents (i.e., parameter-name and type-expression, and
multiplicity) of the parameter are the focuses of this thesis work. They are
referred to parameter name and paramter type, and multiplicy, respectively, in
this thesis.

– Reference to the definition of parameter name in the UML 1.5 specifica-
tions, Chapter 3, Part 5, Section 3.26 “Operation”, parameter name is
defined as an identifier string to represent a parameter [31, p. 45].

23


2. Theory and Related Work

Parameter names are mandatory for the specification of operations. How-
ever, compared with the names specified for attributes and operations,
here the parameter names are considered more related to identifiers for
operations parameters. The reason for this is that the semantics of the
names specified for operations can help us already to do mappings be-
tween the operations in the mcAM and cSC.

– Parameter type is defined as an expression that specifies the type of
the parameter [21, p. 108] (as depicted in Figure 2.4). The parameter
type can be either primitive data type or nonprimitive data type that is
represented by name of the classifier.

For the specification of parameters, compared with the parameter names,
parameter types should be taken more concern, since the parameter
type(s) of an operation is(are) related to the relationships between classi-
fiers. For example, classifier A references classifier B as a parameter type.
This implies a dependency between classifiers A and B (i.e., classifier A
depends on classifier B for its implementation). However, if the parame-
ter type specified in the mcAM is changed or removed in the cSC, which
will further lead to changes in the relationships between these classifiers
(or involving other classifiers). Such differences in relationships between
classifiers in the mcAM and cSC are what this thesis aims to study.

– Multiplicity (referring to the subsubsection 2.1.2.6)

• Reference to UML v1.5 specifications, Chapter 3, Part 5, Section 3.25 “Oper-
ation”, return type is defined as a language-dependent specification of the
implementation type (i.e., the primitive data type in Java), or types of the
value returned by the operation (i.e., the nonprimitive data type in Java) [31,
p. 45].

Note that the colon and the return type are omitted if the operation does not
return a value (as for Java void) [31, p. 45]. Thereby, if no return value is
specified for an operation in the mcAM, yet void is shown as a return type in
the cSC, we do not regard this change as a difference in return types between
the mcAM and cSC.

The reason for us to study the differences in return types is that the differences
in return types might lead to the changes in relationships between classifiers in
the mcAM and cSC. For example, classifier A references classifier B as a return
type. This implies a dependency between classifiers A and B (i.e., classifier
A depends on classifier B). If the return type from classifier B is changed or
removed as a void, this will further lead to changes in the relationship between
these classifiers (or involving other classifiers). Such differences in relation-
ships between classifiers in the mcAM and cSC are what this thesis aims to

24


2. Theory and Related Work

study.

• Visibility (referring to the subsubsection 2.1.2.5)

Except for the four constituents of operations’ textual notation illustrated above,
the other constituent oper-property is not the focus of this thesis, since there is no
relevant case associated with the five projects studied.

2.1.2.5 Visibility

Reference to UML 1.5 specifications, Chapter 2, Part 2, Section 2.5 “Core” [31,
pp. 35–36], the definitions of feature and visibility are given in the following:

Feature is defined as an attribute or operation, which is encapsulated within a
classifier [31, p. 35].

Visibility is defined as specifying whether Feature can be seen and referenced by
other classifiers [31, p. 36].

Four types of visibility and their denotations are illustrated as below:

• Public (denoted by the symbol ‘+’) - Any outside classifiers with visibility to
classifier A can use the Feature of classifier A [31, p. 36].

• Protected (denoted by the symbol ‘#’) - Any descendent of the classifier A
can use the Feature of classifier A [31, p. 36].

• Private (denoted by the symbol ‘−’) - Only the classifier A itself can use the
Feature itself, or nested classifier B within classifier A can use the Feature of
classifier A [31, p. 36].

• Package (denoted by symbol ‘∼’) - Any classifier declared in the same package
(or a nested subpackage, to any level) as the owner of the Feature can use the
Feature [31, p. 36].

Osman et. al [35] proposed that software engineers prefer to leave Private Opera-
tions and Protected Operations out, to make a class diagram simplified. However, it
has not been validated in a case, and to what extent they are left out is unknown.
Thus, this thesis indeed wants to fill this gap. Considering the encapsulation of at-
tributes in the OO paradigm, one inquiry is for simplifying a class diagram, whether
the public set for the setter and getter operations of those private attributes are
preferred to be excluded or even public for those operations and private for those
encapsulated attributes are all excluded.

In another source [36], Osman et. al investigated that “counting the number of
public operations” is the most important metric for indicating the importance of
a class. This is from a reverse direction to analyze how to recover an important

25


2. Theory and Related Work

class for a reverse-engineered class diagram; rather, from the forward direction, for
(both important and secondary important) classes, whether the exclusion of the
public set for operations relates to the types of operations is a question (based on
the three types of operations defined in this thesis: constructors, setter or getter
operations, and operations (besides setter and getter operations, which is detailed
in section 4.3). Thus, those inquiries motivate this thesis to study the visibility in
both attributes and operations.

2.1.2.6 Multiplicity

Multiplicity is defined as an inclusive interval of non-negative integers beginning
with a lower bound and ending with a (possibly infinite) upper bound [21, p. 95].

The textual notation for multiplicity specified by BNF is depicted in Figure 2.5.
Only the multiplicity range is the focus since the other constituents of multiplic-
ity are absent in the five mcAMs studied. Considering the Java specifications, the
multiplicity in the mcAM can be either an array or a collection, and then the mul-
tiplicity range should be [0,+∞] [7].

Multiplicity is specified in the attributes and operations for three constituents that
are related to data types. These three constituents are attribute type, parameter
type, and return type. There are two reasons for focusing on the multiplicity of these
three constituents. The first reason is that there might be differences in the inter-
face types provided in the collection, e.g., Set<name of the classifier> in the mcAM
changed into List<name of the classifier> in the cSC. The second reason is that the
multiplicity not in the mcAM might be added to the cSC, e.g., name of the classifier
in the mcAM is changed into List<name of the classifier> in the cSC. Multiplicity
is a property of these three constituents and represents the number of instances of
the classifier, so multiplicity along with the constituents, should be considered in
parallel. On the other hand, the multiplicity and the relationships of aggregation
and composition are correlated (referring to their corresponding definitions and se-
mantics in subsubsection 2.1.2.7).

Note that the multiplicity placed at the end of an association is not the focus of this
thesis, as it is not easy to manually check how many invocation sites of the instances
in the cSC.

Figure 2.5: Syntax for a multiplicity string [21, p. 98].

26


2. Theory and Related Work

2.1.2.7 Graphical Notation for Relationships

The aim of this thesis is to study seven types of relationships of the mcAM: depen-
dencies, usages, associations, aggregations, compositions, inheritances (also known
as generalizations), and realizations. Reference to UML v1.5 specifications, Chapter
3, Part 5 [31, pp. 34–93], and UML v2.4.1 specifications [21], their definitions are
given as follows (note that due to the unavailability of clear definitions of some re-
lationships, their definitions should be better detailed with their semantics in order
to enhance their comprehensibility):

Dependency is defined as a relationship that relates the model elements (con-
stituents) themselves and does not require a set of instances for its meaning [31,
p. 90]. Dependency signifies that a class requires another class for its specification
or implementation [21, p. 61], so dependency indicates a situation in which a change
to the target element may require a change to the source element in the dependency
[31, p. 90].

Usage is defined as a relationship where one class requires another class for its full
implementation or operation [21, p. 139].

Note that in the metamodel, Usage is a Dependency in which the client requires the
presence of the supplier.

A binary association refers to an association among exactly two classes (includ-
ing the possibility of an association from a class to itself) [31, p. 68].

Aggregation refers to a type of whole/part of a binary association relationship.

Composition refers to a strong form of aggregation that requires a part instance
to be included in at most one composite at a time [21, p. 38]. If a composite is
deleted, all of its parts are normally deleted with it [21, p. 38].

Inheritance refers to a taxonomic relationship between a more general superclass
and a more specific subclass [21, p. 38]. Each instance of the subclass is also an in-
direct instance of the superclass [21, p. 70]. Thus, the subclass inherits the features
of the superclass [21, p. 38].

Realization refers to a relationship between a class and an interface implying that
the class supports the set of features owned by the interface and any of its parent
interfaces [21, p. 89].

The graphical notation for the seven types of relationships defined above is depicted
in Figure 2.6.
Of particular note, according to the definitions of the relationships of dependency,
association, aggregation, and composition, four corresponding levels (from 1 to 4) of
abstraction defined are depicted in Figure 2.7. The higher the level of abstraction,
the lower the level.

27


2. Theory and Related Work

Figure 2.6: Graphical notation for seven types of relationships.

28


2. Theory and Related Work

Figure 2.7: Four abstraction levels specified for four relationships - dependency,
association, aggregation, and composition, respectively.

2.2 Related Work
This section consists of two parts. The first part is related to the currently existing
reverse engineering methods/tools. The second part describes the usage of models
in the software development practice.

2.2.1 Reverse Engineering
Müller et al. [37] refined the definition of reverse engineering in [38] as “a process
of analyzing a subject system to identify its current components and their interre-
lationships and to extract and create system abstractions and design information.”
Reverse engineering methods/tools play a key role in legacy systems based on the
absence of a design. In particular, class diagrams are often poorly to be updated
during development and maintenance [39, 13]. There is a concern that such a diver-
gent class diagram cannot help software maintainers in the same way understand the
system’s architecture during maintenance later on. Thereby, as a solution, reverse
engineering methods/tools can be used to automatically generate reverse-engineered
class diagrams that are extracted from the current code. Such a reverse-engineered
class diagram can represent the up-to-date system’s architecture.

2.2.1.1 Manual Abstraction Created over Code

The class diagrams produced during the design and implementation phases of the
SDLC can be referenced by software maintainers during the maintenance phase to
understand the system’s architecture. However, in some cases, class diagrams may
contain volumes of information [35]. This makes it hard for software maintainers to
understand the system’s architecture [35]. Thereby understanding how abstraction
is manually created by software engineers and thus condensing/simplifying class
diagrams is essential. For this purpose, Osman et al. [35] conducted a survey to
investigate how manual abstraction is created over code. This survey involves 32
software developers, with 75% of the participants having more than 5 years of expe-
rience with class diagrams [35]. As a result, they found the important elements in a
class diagram are class relationships, meaningful class names, and class properties

29


2. Theory and Related Work

[35]. Also, the information that should be excluded in a simplified class diagram
is GUI-related information, Private and Protected operations, Helper classes, and
Library classes [35]. However, these findings are needed to be validated in a case.
The five case studies employed in this thesis can help validate these findings to some
extent.

2.2.1.2 Solutions and Attempts to Provide Abstraction for Reverse-
Engineered Class Diagrams

To make reverse-engineered diagrams both abstract and precise is a primary goal in
the reverse engineering community. Thereby, concepts related to diagrams conden-
sation/simplification or diagrams abstractness are first proposed along with several
technologies developed later on aiming at archiving this goal.

Guéhéneu [5] proposed that both abstract and precise reverse engineering tools do
not yet exist on the market. Guéhéneu [7] started by developing a tool named
PTIDEJ aiming to produce precise reverse-engineered class diagrams, in particular,
to infer use, association, aggregation, and composition relationships based on the
consideration of lacking clear definitions of those relationships. PTIDEJ [7] per-
forms even more accurately than class diagrams manually created by humans. Sub-
stantially, Guéhéneu [5] argued that the lack of abstraction with respect to current
existing reverse engineering tools is because of the lack of clear definitions of class di-
agrams’ constituents. Thereby, Guéhéneu systematically studied constituents of the
class diagrams in reference to UML 1.5 specifications and refined their definitions.
Guéhéneu [5] then exemplified the study with PTIDEJ to reverse Java programs as
UML diagrams abstractly and precisely.

We agree with the assertion in [5] proposed by Guéhéneu, i.e., the definitions of
some constituents of class diagrams are vague and verbose. Thus, the refinements of
definitions in [5, 40] by Guéhéneu helped this thesis work a lot in regard to mappings
between mcAM constituents and cSC constructs.

The lack of abstraction with respect to reverse engineering tools/methods is pro-
posed by other authors as well, according to the source [35, 41, 39]. The resultant
diagram generated by reverse-engineering methods/tools is often very cluttered [41].
This is of little help to software engineers in understanding the system’s architecture
since it is hard for them to locate the key places of primary interest.

Regarding condensation of reverse-engineered class diagrams to enhance their com-
prehensibility, Osman et al. [42] proposed an approach by using a supervised clas-
sification algorithm where design metrics (e.g., number of operations, number of
attributes, etc.) as the input. Yet, an elemental question left out is which elements
of the system’s architecture should be selected for accommodating various levels of
abstraction [39]. An extension of Osman et al.’s work is conducted by Thung et al.
They