Manually Mapping Model Elements onto the Modeled Code by Analyzing GitHub Data Master’s thesis in Computer science and engineering WENLI ZHANG Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Manually Mapping Model Elements onto the Modeled Code by Analyzing GitHub Data WENLI ZHANG Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Manually Mapping Model Elements onto the Modeled Code by Analyzing GitHub Data WENLI ZHANG © WENLI ZHANG, 2023. Supervisor: Regina Hebig, Computer Science and Engineering Examiner: Christian Berger, Computer Science and Engineering Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2023 iv Manually Mapping Model Elements onto the Modeled Code by Analyzing GitHub Data WENLI ZHANG Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Context: Class diagrams are one of the most popular UML models and are fre- quently used in the early stages of software development. The advantage of using class diagrams is that they can reflect design decisions and the system’s implemen- tation structure. Maintainers can use class diagrams to understand the system’s implementation structure. Yet, as the code evolves, the absence of updating class diagrams will cause the code implementation to deviate from the class diagram de- sign. One concern is that such a divergent class diagram does not help maintainers much in the same way during the maintenance stage. As a solution, reverse engi- neering methods/tools can reverse code into class diagrams. Yet, another concern comes up, in most cases, the reverse-engineered class diagrams are not abstract, and they contain extensive information that will burden the understanding of the system’s implementation structure. This is because the existing reverse engineering methods/tools are imperfect as they do not manage to imitate the human ability to abstract relevant information from the source code. Surprisingly, existing studies on the characteristics of manual abstraction are based on the opinions and experiences of participants but do not study actual cases of models and source code. Also, the methods/technologies used for checking the similarities and differences between the models and source code are purely structural but do not analyze or take the se- mantics of the model elements into account when mapping classes from models and code. The semantics is closely related to abstraction creation. Thereby, a systematic manual study on the characteristics of manual abstraction is required. Aim: To fill this gap, this thesis aimed at studying the characteristics of the differ- ences between the class diagram design and the code implementation by manually creating the mappings between the class diagram elements/constituents and the code constructs. Our manual studies can precisely capture the differences between the class diagram design and source code implementation and investigate the causes of these differences. Method: We employed the methodology of five case studies. The five subjects studied are five Java open-source projects collected from GitHub. They are semi- randomly selected from the Linholmen dataset [1]. Results: For the differences between the class diagram design and code implemen- tation, three causes are summarized: various levels of manual abstraction created in class diagrams, deviations of code implementation from class diagram design, v and common changes between the class diagram elements/constituents and code constructs. We contribute to a sorted list of cases corresponding to these three causes. Keywords: UML, Models in Open Source Systems, Reverse Engineering, Deviations between Code and Design, Manual Abstraction in Modeling vi Acknowledgements Approaching the end of my master’s degree, I would take this opportunity to express my gratitude to all my lecturers, former colleagues, friends, and family who helped and supported me through these two and a half years. My thesis supervisor, Dr. Regina Hebig, for your all-around support and trust in my thesis work has been like the warmest sunshine in a dark Swedish winter. Your knowledge, inner drive, and kindness in helping me all around (not limited to aca- demics) fully affected my attitude toward my future studies and life - being loyal and optimistic. Of particularly unforgettable days when I fell into the definition of terminologies, you guided me patiently, even hand in hand, taught and inspired me to find the best solution to accommodate the puzzle I was in at that time. Your valuable and interesting learning experience gained when you were a former student in the past, shared with me, will be an invaluable asset in my life. My examiner, Dr. Christian Berger, for your valuable input in my thesis proposal and half-time presentation, helping me think of the methodology of this thesis more considerably. Dr. Richard Torkar, for whom I served as a teaching assistant for about six months. Your enthusiasm for research and willingness to help students out has been encouraging me to want to be a person the way you are. To all of my lecturers for instilling me with software engineering-related knowledge and cultivat- ing my problem-solving and independent thinking abilities. To my former company mentor, Dr. Jianlin Shi, and my director, Yujie Xue, with whom I worked for half a year in 2021. Your full support and care greatly boosted my confidence to pursue what I am into (this thesis work) at the campus. To my friends, who cheered me up and always believed in me, you are all like true “family members”. We shared our interests and hobbies, patiently taught each other some life tips, and explored new things together. This constructs a major part of my life towards a balance of learning and living. To the most important persons coming into my life - my parents, brother, and sister, who have always influenced me to be brave and supported me to pursue what I have been interested in since I was a child. My great interest in modeling and software architecture supported me in finalizing this thesis. I am grateful to Dr. Regina Hebig again for providing me with this thesis topic, which helped me to step into my area of interest and establish a framework for learning which can be further extended in the future. Without all of you, this thesis would not have been possible. Wenli Zhang, Gothenburg, February 2023 viii x Contents List of Figures xvii List of Tables xxiii List of Acronyms xxiv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Goal and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Case Study Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Definition of Terminologies . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5.1 Examples of Mappings between mcAM Concepts and voSC . . 5 1.5.2 Ideal Selection of One cSC among Multiple cSCs of a mcAM . 9 1.6 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Structure of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Theory and Related Work 15 2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Related Java Knowledge . . . . . . . . . . . . . . . . . . . . . 15 2.1.1.1 Java SE7 Specifications . . . . . . . . . . . . . . . . 16 2.1.1.2 Object-Oriented (OO) Paradigm . . . . . . . . . . . 16 2.1.2 Unified Modeling Language (UML) . . . . . . . . . . . . . . . 17 2.1.2.1 Graphical Notation for Classifiers . . . . . . . . . . . 17 2.1.2.2 Conversions of BNF . . . . . . . . . . . . . . . . . . 18 2.1.2.3 Textual Notation for Attributes . . . . . . . . . . . . 19 2.1.2.4 Textual Notation for Operations . . . . . . . . . . . 22 2.1.2.5 Visibility . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.2.6 Multiplicity . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.2.7 Graphical Notation for Relationships . . . . . . . . . 27 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 Reverse Engineering . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1.1 Manual Abstraction Created over Code . . . . . . . . 29 2.2.1.2 Solutions and Attempts to Provide Abstraction for Reverse-Engineered Class Diagrams . . . . . . . . . . 30 2.2.1.3 Consistency Check(s) between Code and Design . . . 31 xi Contents 2.2.2 Models in Software Development Practice . . . . . . . . . . . 32 2.2.2.1 Usage of Models . . . . . . . . . . . . . . . . . . . . 32 2.2.2.2 Dataset/Database of Models . . . . . . . . . . . . . . 33 3 Methodology 35 3.1 Open Source Repositories Access . . . . . . . . . . . . . . . . . . . . 36 3.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Definition of Terminologies . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Ideal Selection of One mcAM from a Project . . . . . . . . . . 37 3.2.3 Ideal Selection of One cSC among Multiple cSCs of that mcAM 38 3.2.3.1 Time and Resource Consumption . . . . . . . . . . . 40 3.3 Automatic Reverse Engineering Tool Selection . . . . . . . . . . . . . 41 3.4 Spreadsheet Comparison Template Design . . . . . . . . . . . . . . . 41 3.4.1 Turning mcAM and cSC into Pictures for Differentiating . . . 45 4 Results 49 4.1 Information on the Project Studied . . . . . . . . . . . . . . . . . . . 49 4.2 Suspected Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Cases of the Differences Caused by MA and disAGTs . . . . . . . . . 51 4.3.1 Cases of MA - Classes . . . . . . . . . . . . . . . . . . . . . . 53 4.3.1.1 Case MC-1 - Hierarchical inheritance structure not created in the mcAM is added to the cSC . . . . . . 53 4.3.1.2 Case MC-2 - One class created in the mcAM is di- vided into more than one class in the cSC (related to the specific design patterns) . . . . . . . . . . . . 53 4.3.2 Cases of MA - Attributes . . . . . . . . . . . . . . . . . . . . . 55 4.3.2.1 Case MA-1 - Additional attributes not in the mcAM are added to the cSC . . . . . . . . . . . . . . . . . . 55 4.3.2.2 Case MA-2∗ - An attribute of classifier A whose type can indicate an aggregation or a composition be- tween classifiers A and B from the cSC, is modeled out by the naming of the association between them in the mcAM . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2.3 Case MA-3∗ - The additional attribute type not in the mcAM is added to the cSC . . . . . . . . . . . . 56 4.3.2.4 Case MA-4 - The additional default value not in the mcAM is added to the cSC . . . . . . . . . . . . . . 56 4.3.2.5 Case MA-5 - One or more common attributes in one or more subclasses in the mcAM are upshifted to the hierarchical structure in the cSC, i.e., the superclass inherited by these subclasses . . . . . . . . . . . . . 57 4.3.2.6 Case MA-6 - One attribute created from the mcAM is divided into more than one attribute in the cSC . 58 4.3.3 Cases of MA - Operations . . . . . . . . . . . . . . . . . . . . 59 4.3.3.1 Case MO-1 - Additional operations not in the mcAM are added to the cSC (three subcases) . . . . . . . . 59 xii Contents 4.3.3.2 Case MO-2 - Additional parameter names not in the mcAM are added to the cSC . . . . . . . . . . . . . 61 4.3.3.3 Case MO-3 - Additional parameter types not in the mcAM are added to the cSC . . . . . . . . . . . . . 62 4.3.3.4 Case MO-4 - Additional return types not in the mcAM are added to the cSC . . . . . . . . . . . . . . . . . . 62 4.3.3.5 Case MO-5∗ - Multiplicity (referring to the collection- related interfaces) specified for the return type with- out its corresponding implementation interfaces/classes specified in the mcAM, yet with them specified in the cSC . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.3.6 Case MO-6∗ - The default parameters of the opera- tion (in Android) in the cSC are omitted in the mcAM 63 4.3.3.7 Case MO-7 - One operation created in the mcAM is divided into multiple operations in the cSC . . . . . 64 4.3.4 Cases of MA - Relationships . . . . . . . . . . . . . . . . . . . 64 4.3.4.1 Case MR-1 - Additional relationships not in the mcAM are added to the cSC (two subcases with their six and two corresponding concluded causes, respectively) . . 67 4.3.4.2 Case MR-2 - An aggregation between A (whole) and B (part) from the cSC is modeled as an association between A (origin) and B (target) in the mcAM (two causes) . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.4.3 Case MR-3 - A composition between A (whole) and B (part) from the cSC is modeled as an association between A (origin) and B (target) in the mcAM (one cause) . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.5 Cases of disAGTs - Classes . . . . . . . . . . . . . . . . . . . . 84 4.3.5.1 Case DC-1 - Hierarchical inheritance structures in the mcAM are removed in the cSC . . . . . . . . . . 84 4.3.6 Cases of disAGTs - Attributes . . . . . . . . . . . . . . . . . . 85 4.3.6.1 Case DA-1 - Attributes in the mcAM are removed in the cSC . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.6.2 Case DA-2∗ - Attributes from the mcAM are re- placed with additional attributes (that are not in the mcAM added to the cSC) in the cSC . . . . . . . 86 4.3.6.3 Case DA-3∗ - The attribute type in the mcAM and cSC is different (two causes) . . . . . . . . . . . . . . 86 4.3.6.4 Case DA-4∗ - Converting variables from the mcAM to constant variables in the cSC . . . . . . . . . . . . 87 4.3.6.5 Case DA-5∗ - The default value not in the mcAM is added to the cSC . . . . . . . . . . . . . . . . . . . . 88 4.3.7 Cases of disAGTs - Operations . . . . . . . . . . . . . . . . . 89 4.3.7.1 Case DO-1 - Operations in the mcAM are removed in the cSC . . . . . . . . . . . . . . . . . . . . . . . . 89 xiii Contents 4.3.7.2 Case DO-2 - The parameter names in the mcAM are removed in the cSC . . . . . . . . . . . . . . . . . . . 89 4.3.7.3 Case DO-3 - The parameter types in the mcAM are removed in the cSC . . . . . . . . . . . . . . . . . . . 89 4.3.7.4 Case DO-4∗ - The parameter type in the mcAM and cSC is different (one cause) . . . . . . . . . . . . . . 89 4.3.7.5 Case DO-5 - The return types in the mcAM are re- moved and as void in the cSC . . . . . . . . . . . . . 90 4.3.7.6 Case DO-6∗ - One or more operations from classifier A in the mcAM are moved to classifier B in the cSC (related to the particular architectural patterns) . . . 91 4.3.8 Cases of disAGTs - Relationships . . . . . . . . . . . . . . . . 92 4.3.8.1 Case DR-1 - Relationships between A and B from the mcAM are removed in the cSC (one cause) . . . 93 4.3.8.2 Case DR-2 - A composition between A (whole) and B (part) from the mcAM is changed into an aggre- gation in the cSC (one cause) . . . . . . . . . . . . . 94 4.3.8.3 Case DR-3∗ - Relationships between A and B from the mcAM are replaced with new relationships in the cSC (two causes) . . . . . . . . . . . . . . . . . . . . 95 4.4 Cases of the Differences Caused by CC . . . . . . . . . . . . . . . . . 98 4.4.1 Cases of CC - Classes . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.1.1 Case CC-1 - Naming of classes in the mcAM and cSC is different (three causes) . . . . . . . . . . . . . 100 4.4.2 Cases of CC - Attributes . . . . . . . . . . . . . . . . . . . . . 101 4.4.2.1 Case CA-1 - Naming of attributes in the mcAM and cSC is different (two causes) . . . . . . . . . . . . . . 101 4.4.2.2 Case CA-2 - Naming of attribute types is different (two causes) . . . . . . . . . . . . . . . . . . . . . . . 102 4.4.3 Cases of CC - Operations . . . . . . . . . . . . . . . . . . . . 103 4.4.3.1 Case CO-1 - Naming of operations in the mcAM and cSC is different (four causes) . . . . . . . . . . . . . 103 4.4.3.2 Case CO-2 - Naming of parameters in the mcAM and cSC is different (two causes) . . . . . . . . . . . 104 4.4.3.3 Case CO-3 - Naming of parameter types in the mcAM and cSC is different (one cause) . . . . . . . . . . . . 105 4.4.3.4 Case CO-4 - Naming of return types in the mcAM and cSC is different (one cause) . . . . . . . . . . . . 105 4.4.4 Ratios of Cases with Related Projects . . . . . . . . . . . . . . 106 4.4.5 Typical/Common Cases with Related Projects . . . . . . . . . 107 4.4.6 Aggregated Results of the Cases for the Project . . . . . . . . 108 5 Discussion 111 5.1 Reflection on Data Selection . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Limitations of Reverse Engineering Tools . . . . . . . . . . . . . . . . 112 5.3 Advantages of Manual Mappings . . . . . . . . . . . . . . . . . . . . 113 xiv Contents 5.4 Guideline Enlightened by Cases . . . . . . . . . . . . . . . . . . . . . 113 5.5 Related Technologies for Healing the Cases . . . . . . . . . . . . . . . 115 5.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6.1 Threats to Construction Validity . . . . . . . . . . . . . . . . 117 5.6.2 Threats to External Validity . . . . . . . . . . . . . . . . . . . 117 5.6.3 Threats to Conclusion Validity . . . . . . . . . . . . . . . . . 118 6 Conclusion and Future Work 119 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.1 Covering Additional OOP Projects . . . . . . . . . . . . . . . 120 6.2.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography 121 A Corresponding mcAMs in the Five Projects Studied I xv Contents xvi List of Figures 1.1 Three challenges involved in the staged SDLC are likely to cause the implementation of the source code to deviate from the design of the class diagram; thereby, differences between the class diagram and the source code are introduced. . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 In the mcAM, for the concept, e.g., Dog, there is a second concept Pet is described by the superclass Pet of the class Dog that describes the concept Dog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 For the superclass Pet of the class Dog that describes a second concept Pet of a conceptDog in the mcAM, the class Pet in this voSC is considered to be mapped to that superclass Pet of the class Dog in the mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 A class SinglePlayModel describes a concept SinglePlayerModel in the mcAM, yet the naming of the class SinglePlayModel with a misspelling. 6 1.5 The class SinglePlayerModel in this voSC has highly similar attributes and operations with the class SinglePlayModel in the mcAM and thus, the class SinglePlayModel is considered to be mapped to the class SinglePlayerModel in this voSC. . . . . . . . . . . . . . . . . . . . . . 6 1.6 In the mcAM, a concept, i.e., decorators is described by a class ItemDecorator, which might imply the decorator design pattern would be applied in the voSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 In this voSC, the concept decorators in the mcAM is described by the superclass Slot, and the subclasses LeftUpgradeSlot, and Righ- tUpgradeSlot derived from the superclass Slot. These three classes are the extension of the interface Icon embedded in the Java library. This interface Icon is invoked in a class related to View, which is re- sponsible for communicating with the subclass Upgrade derived from the superclass Item in the mcAM. . . . . . . . . . . . . . . . . . . . . 7 1.8 In the mcAM, for the concept, e.g., Dog a second concept Pet is described by the superclass Pet of the class Dog that describes the concept Dog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.9 This voSC only includes one class Pet that can be mapped to the superclass Pet of the class, e.g., Dog in the mcAM. . . . . . . . . . . 8 1.10 A concept, i.e., InputFunction is described by a class InputFunction in the mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.11 The concept InputFunction in the mcAM is described by a superclass InputFunction that not in the mcAM but added to the voSC. . . . . . 9 xvii List of Figures 1.12 In the mcAM, eight concepts are described by eight classes, i.e., BaseUI, BaseController, IncomeRegisterUI, RegisterIncomeController, Income, IncomeTyperRepository, CheckingAccount, and IncomeRepos- itory, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.13 A cSC of the mcAM covers all eight concepts in the mcAM. . . . . . 11 1.14 Compared with the cSC illustrated in Figure 1.13, this cSC of the mcAM covers four more operations. . . . . . . . . . . . . . . . . . . . 11 2.1 Three abstraction levels of graphical class notation: suppressed (on the top left corner), analysis (on the right), and implementation (on the bottom left corner) [21, p. 50]. Compared with the suppressed level with only the specified name of a class, another two levels present a more comprehensive layout of a class with more or less) textual notation for attributes and operations embedded. . . . . . . . . . . . 18 2.2 Specification of Backus-Naur Form (BNF) conversions [21, pp. 16–17]. 19 2.3 Textual notation for attributes [21, pp. 129–130]. . . . . . . . . . . . 19 2.4 Textual notation for operations [21, pp. 107–108]. . . . . . . . . . . . 22 2.5 Syntax for a multiplicity string [21, p. 98]. . . . . . . . . . . . . . . . 26 2.6 Graphical notation for seven types of relationships. . . . . . . . . . . 28 2.7 Four abstraction levels specified for four relationships - dependency, association, aggregation, and composition, respectively. . . . . . . . . 29 3.1 The overall process of the methodology. . . . . . . . . . . . . . . . . . 35 3.2 The selection process of one ideal cSC among multiple cSCs of the selected mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Record information about the project. . . . . . . . . . . . . . . . . . 42 3.4 Record information of the mcAM. . . . . . . . . . . . . . . . . . . . . 43 3.5 Record information of the cSC. . . . . . . . . . . . . . . . . . . . . . 44 3.6 Record the differences between the mcAM and the cSC of that mcAM. 44 3.7 Record additional notes. . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.8 For the class CheckingAccount, the attributes, e.g., income: Income is modeled in the mcAM. . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9 For the class CheckingAccount, that attribute income: Income from the mcAM is replaced with fully new attributes incomeRepo: Incom- eRepository in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.10 We recorded the difference, i.e., attributes from the mcAM are re- placed with new attributes not in the mcAM yet added to the cSC, differentiated into disAGTs in our designed comparison template. . . 47 4.1 The subclasses, e.g., Max implemented in the cSC are hidden in the mcAM. Also, the superclass InputFunction inherited by these sub- classes in the cSC is modeled as a class in the mcAM. . . . . . . . . . 53 4.2 A hierarchical inheritance structure with the corresponding additional two subclasses not in the mcAM is added to the cSC. . . . . . . . . . 53 4.3 In the mcAM, a concept, i.e., decorators is described by a class ItemDecorator, which might imply the decorator design pattern will be applied in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 54 xviii List of Figures 4.4 In the cSC, the concept decorators in the mcAM is described by the superclass Slot, and the subclasses LeftUpgradeSlot, and Righ- tUpgradeSlot derived from the superclass Slot. These three classes are the extension of the interface Icon embedded in the Java library. This interface Icon is invoked within a View class, which is respon- sible for communicating with the subclass Upgrade derived from the superclass Item in the mcAM. . . . . . . . . . . . . . . . . . . . . . . 54 4.5 The naming of the association between the classes LaborBilling and PhaseLabor is specified with the attribute name lb in the cSC. . . . . 56 4.6 In the cSC, the attribute name lb indeed exits in the class PhaseLabor. 56 4.7 In the mcAM, for the attributes, e.g., email, a default value is not assigned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.8 A default value “ ” not in the mcAM is added to the attribute email. 57 4.9 In the mcAM, for the subclasses Food and Upgrade, there is a common attribute image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.10 The common attribute image of the subclasses Food and Upgrade in the mcAM is upshifted to their superclass Item in the cSC. . . . . . . 58 4.11 Three subcases correspond to this case with their corresponding causes. 61 4.12 Case MR-1 with the corresponding subcases and causes. . . . . . . . 65 4.13 Cases MR-2 and MR-3, with the corresponding causes. . . . . . . . . 66 4.14 In the mcAM, no association is modeled between the classes Labor- Billing and Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.15 In the cSC, an operation getId() of LaborBilling is invoked in an op- eration getPhaseLabor(LaborBilling, Phase): PhaseLabor of Pattern. Yet, the parameter types of the latter operation, e.g., LaborBilling, are not modeled out in the mcAM (as shown in Figure 4.14). . . . . . 68 4.16 In the mcAM, no relationship is modeled between the classes BaseC- ontroller and CheckingAccount. Plus, the return type CheckingAc- count modeled in BaseController in the mcAM can partially imply an association to exist in the cSC. . . . . . . . . . . . . . . . . . . . 69 4.17 In the cSC, an instance of CheckingAccount is created within the op- eration buildCheckingAccount(): CheckingAccount of BaseController, further being returned to this operation. . . . . . . . . . . . . . . . . 69 4.18 In the mcAM, no relationship is modeled between the classes ItemVis- itor and Food. Yet, an instance of Food, i.e., is specified as a parameter type of the operation named visit of ItemVisitor in the mcAM. . . . . 70 4.19 In the cSC, for Food, its operation getPrice(): int inherited from the superclass Item is indeed invoked within visit(Food): void of ItemVis- itor in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.20 In the mcAM, no relationship is modeled between the classes Single- Player and TitlePage. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.21 In the cSC, an instance of TitlePage is created within an operation pausedMainMenu(View view): void of SinglePlayer. . . . . . . . . . . 72 4.22 In the mcAM, no relationship is modeled between the classes User- Builder and Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xix List of Figures 4.23 In the cSC, an instance named userBuilder of UserBuilder is created within the operation newUser(int, String, String, String) of Control. Plus, an operation of userBuilder, e.g., setEmail(): String, is further invoked with the same operation of Control (in the mcAM )/Raise- MeUp (in the cSC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.24 In the mcAM, no relationship is modeled between the classes Food and Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.25 In the cSC, an instance of Food (named f ) is specified as a parameter type of an additional operation removeFood(Food): boolean of Con- trol/RaiseMeUp. This instance of Food is used in another operation delFood(f): boolean of another class Dao. Plus, in Dao, an operation of Food, e.g., getName(), is further invoked with that delFood(Food): boolean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.26 In the mcAM, no relationship is modeled between the classes Neuron and NeuralNetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.27 In the cSC, an additional attribute inputNeurons: NeurophArrayList not in the mcAM is added to the class NeuralNetwork. The opera- tion size() of Neuron is further invoked within the operation get- InputsCount(): int of NeuralNetwork. Plus, the instances of Neuron are contained by not only NeuralNetwork but also by Layer in the cSC. 77 4.28 In the mcAM, no relationship is modeled between the classes Job and Dao. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.29 In the cSC, an additional attribute jobs: Map not in the mcAM is added to the class DAO and it is further created in Dao body. An operation put() (an API) is further invoked in the operation getJob(): Map of Dao with the instance jobs. However, the instances of Job are contained by not only DAO but also by another class Pet in the cSC. . . . . . . . . . . . . . . . . 78 4.30 The aggregation between Employee and Schedule from the cSC is modeled as an association in the cSC. Plus, the naming of this asso- ciation in the mcAM is specified by an attribute name (whose related type is specified by the instances of Schedule) of Employee from the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.31 In the cSC, an attribute type ArrayList contained in the class Employee means a group of instances of the class Schedule de- clared in Employee. These instances of Schedule are further created within the operation Employee(Integer, String, String) of Employee in the cSC. However, the instances of the class Schedule are contained not only by the instances of Employee but also by the instances of another class MainClass in the cSC. . . . . . . . . . . . . . . . . . . . 80 4.32 An aggregation between Income and IncomeRepository from the cSC is partially indicated by the attribute type (instances of Income) of IncomeRepository, i.e., List, in the mcAM. . . . . . . . . . 82 xx List of Figures 4.33 In the cSC, the type of the attribute named listIncome is specified by a collection of instances of Income, i.e., ArrayList in In- comeRepository . This collection of instances of Income is further created within an operation IncomeRepository() of IncomeReposi- tory. Plus, the instances of Income are invoked within an opera- tion save(Income): void of IncomeRepository in the cSC. Yet, the instances of Income are contained by not only IncomeRepository but also another class RegisterIncomeController in the cSC. . . . . . . . . 82 4.34 A composition between the classes SinglePlayer and SinglePlayModel from the cSC is modeled as an association in the mcAM. . . . . . . . 84 4.35 In the cSC, an instance model of the class SinglePlayerModel that is modeled out as an attribute type in the class SinglePlayer in the mcAM is indeed declared as an attribute type of the class Single- Player in the cSC. This created instance model of SinglePlayerModel is further invoked within an operation of SinglePlayerModel, e.g., on- KeyDown(int, KeyEvent): boolean. Plus, in the cSC, this instance model is exclusive to the corresponding instances of SinglePlayer. . . 84 4.36 A hierarchical inheritance structure in the mcAM. . . . . . . . . . . . 85 4.37 The hierarchical inheritance structure in the mcAM is removed in the cSC. Solely the superclass Pet in the mcAM is remained yet changed into a class in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.38 For the class CheckingAccount, the attributes income: Income and amount: BigDecimal are modeled in the mcAM. . . . . . . . . . . . . 86 4.39 For the class CheckingAccount, all its attributes in the mcAM are re- placed with fully new attributes incomeRepo: IncomeRepository and expenseRepo: ExpenseRepository in the cSC. . . . . . . . . . . . . . . 86 4.40 Two causes for case DA-3∗. . . . . . . . . . . . . . . . . . . . . . . . . 87 4.41 In the mcAM, for the attribute named owner, User (a non-primitive data type) is specified. Also, for its setter operation named se- tOwner(), a parameter type User is specified. . . . . . . . . . . . . . 90 4.42 In the cSC, for that attribute named owner its specified type User in the mcAM changed to int instead. Accordingly, for the setter operation named setOwner() of that attribute, its parameter type changed from User to int. . . . . . . . . . . . . . . . . . . . . . . . . 90 4.43 In the mcAM, for the class DAO, the operations, e.g., listUser(): Map is created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.44 In the cSC, the operation, e.g., listUser(): Map created in the class DAO from the mcAM is moved to another class RaiseMeUp/Controll as listUsers(): Map. Then it is used for getting a list of user data from the Model-related class Dao in the cSC. . . . . . . 92 4.45 Cases DR-1, DR-2, and DR-3∗, with the corresponding causes. . . . . 93 4.46 In the mcAM, a composition is created between the classes Pet and User. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 xxi List of Figures 4.47 In the cSC, the attribute type whose type is specified by an instance of User from the mcAM is converted to a primitive data type int. This leads to the corresponding composition between Pet and User from the mcAM being removed in the cSC. . . . . . . . . . . . . . . . 94 4.48 In the mcAM, a composition is modeled between the classes Layer and Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.49 In the cSC, the instances of Neuron are contained by not only Layer but also NeuralNetWork. . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.50 In the mcAM, an association is created between CheckingAccount and IncomeRepository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.51 For CheckingAccount, the attributes from the mcAM, e.g., income: Income, are replaced with fully new attributes, e.g., incomeRepo: In- comeRepository in the cSC. The specified attribute type, an instance of IncomeRepository, is created within an operation CheckingAccount() of CheckingAccount in the cSC. This instance is further invoked within another operation add(Income): void of CheckingAccount in the cSC. Furthermore, the instances of IncomeRepository are contained not only by CheckingAccount but also by another class ValuesCalculator in the cSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.52 In the mcAMA, an association is created between Pet (origin) PetO- bserver (target). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.53 RaseiMeUp (target) is linked with Pet (origin); RaseiMeUp (origin) is linked with PetObserver (target). Thus, an indirect association between Pet (origin) PetObserver (target) is built up. . . . . . . . . . 98 4.54 Causes for the cases caused by CC. . . . . . . . . . . . . . . . . . . . 99 A.1 Selected mcAM included in project 1. . . . . . . . . . . . . . . . . . . I A.2 Selected mcAM included in project 2. . . . . . . . . . . . . . . . . . . II A.3 Selected mcAM included in project 3. . . . . . . . . . . . . . . . . . . III A.4 Selected mcAM included in project 4. . . . . . . . . . . . . . . . . . . IV A.5 Selected mcAM included in project 5. . . . . . . . . . . . . . . . . . . V xxii List of Tables 4.1 The project background (∼ = around). . . . . . . . . . . . . . . . . . 49 4.2 Suspected cases caused by MA and disAGTs potentially exist in other voSC(s)/cSC(s) not selected by us previouslt. . . . . . . . . . . . . . 51 4.3 Cases for the differences caused by MA and disAGTs (∗ = own case, <> = opposite). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 The corresponding example of the case MA-1. . . . . . . . . . . . . . 55 4.5 The corresponding example of the case MA-3∗. . . . . . . . . . . . . . 56 4.6 The corresponding example of the case MA-6. . . . . . . . . . . . . . 59 4.7 Corresponding examples of the causes for the three subcases of case MO-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.8 Six corresponding examples of the cases titled MO-2, MO-3, MO-4, MO-5∗, MO-6∗, and MO-7. . . . . . . . . . . . . . . . . . . . . . . . . 64 4.9 The corresponding example of the case DA-1. . . . . . . . . . . . . . 86 4.10 Two corresponding examples of the causes for case DA-3∗. . . . . . . 87 4.11 The corresponding example of the cases DA-4∗ and DA-5∗ . . . . . . . 88 4.12 Four corresponding examples of the cases DO-1, DO-2, DO-3 and DO-5. 91 4.13 Seven cases of the differences caused by CC. . . . . . . . . . . . . . . 98 4.14 Three corresponding examples of the causes for case CC-1. . . . . . . 101 4.15 Two corresponding examples of the causes for case CA-1. . . . . . . . 101 4.16 Two corresponding examples of the causes for case CA-2. . . . . . . . 102 4.17 Four corresponding examples of the causes for case CO-1. . . . . . . . 104 4.18 Two corresponding examples of the causes for case CO-2. . . . . . . . 105 4.19 The corresponding example of the cause for case CO-3. . . . . . . . . 105 4.20 Two corresponding examples of the cause for case CO-4 (∗ = partic- ular interest). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.21 Ratios of the cases with involved projects (∗ = own case). . . . . . . . 107 4.22 Typical/common cases with related projects. . . . . . . . . . . . . . . 108 4.23 Project 1 - Respective differentiated cases of MA, disAGTs, and CC (∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.24 Project 2 - Respective differentiated cases of MA, disAGTs, and CC (∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.25 Project 3 - Respective differentiated cases of MA, disAGTs, and CC (∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.26 Project 4 - Respective differentiated cases of MA, disAGTs, and CC (∗ = own case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.27 Project 5 - Respective differentiated cases of MA, disAGTs, and CC. . 110 xxiii List of Tables xxiv List of Acronyms Uncommon acronyms mcAMs Manually created architectural models voSC Version(s) of source code cSC Conformable source code MA Manual abstraction disAGTs Disagreements CC Common changes General acronyms SDLC Software Development Life Cycle OOP Obeject-Oriented Programming UML Unified Modeling Language FOSS Free/Open Source Software API Application Programming Interface EA Enterprise Architect IDEA IntelliJ IDEA AST Abstract Syntax Tree MVC Model-View-Controller MVVM Model-View-ViewModel OCL Object Constraint Language MDE Model-Driven Engineering xxv 1 Introduction 1.1 Background The software development life cycle (SDLC) is defined by Stocia et al. [2] as “an environment that describes the activities performed in each stage of the software development process.” In the early stages of the SDLC, Unified Modeling Lan- guage (UML) is widely used to model and visualize system artifacts [3]. Among the various models of UML, the class diagram is the most commonly used and impor- tant diagram in object-oriented system modeling [4]. The advantage of using class diagrams in practice is that it can help software engineers—both developers and maintainers—to understand systems architectures, behaviors, design choices, and implementations [5]. Thus, it is easier for developers and maintainers to understand the structure of the system by looking at the class diagram rather than reading through the code in detail. 1.2 Challenges Class diagrams model the information on the domain of interest in terms of objects organized in classes and relationships between them [6]. They are intensively used in the early stages of the SDLC to present the system’s structure. Maintainers benefit from using class diagrams to understand the system’s structure, and thus the places required to be modified can be located [7]. However, there are three challenges re- vealed by other authors that are likely to cause the implementation of the source code to deviate from the design of the class diagram. There is a concern that such divergent class diagrams cannot help developers and maintainers to understand the structure of the system in the same way. These three challenges affect the activities performed by software engineers in dif- ferent stages of the SDLC. Note that the stages of the SDLC vary depending on the source [2, 8, 9]. Figure 1.1 illustrates five of the stages: analysis, design, imple- mentation, testing, and maintenance. This thesis focuses on three of them which involve the three challenges: design, implementation, and maintenance, respectively (as shown in Figure 1.1, stages filled in blue). Architects and their teams, developers, and maintainers are involved in each of these three stages. A detailed description of the three challenges depicted in Figure 1.1 is as follows: • Challenge 1 - Creating various levels of manual abstraction on the elements of 1 1. Introduction the class diagram during the design stage: Osman [p. 45, 10] proposed that class diagrams with a low level of detail are used to show a high-level abstrac- tion of the structure of the system. However, little is known about which level of abstraction the architects and their teams create during the design phase. • Challenge 2 - Missing or unfollowed parts of the class diagram during the implementation stage: Guéhéneu [5] proposed that class diagrams produced during the design stage are often forgotten during the implementation stage, under time pressure usually. Truong et al. [11] investigated that many created designs are only partially followed during implementation. Thus, missing or unfollowed parts of the class diagram will cause the implementation of the source code to deviate from the design of the class diagram. • Challenge 3 - Missing updates of the class diagram during the maintenance stage as the code evolves: Osman et al. [12] proposed that the frequency of updating UML models is low, and a new feature of the system is introduced in a new version/release, which should result in an update of the class diagram. We agree with the assertion by Osman et al. [13] that keeping diagrams up to date with code evolution is often a challenge. Figure 1.1 illustrates that code evolves over time from implementation to maintenance. However, if the class diagram remains static (without update) as the code evolves, it does not reflect the new features introduced to the system. Figure 1.1: Three challenges involved in the staged SDLC are likely to cause the implementation of the source code to deviate from the design of the class diagram; thereby, differences between the class diagram and the source code are introduced. 2 1. Introduction 1.3 Goal and Motivation As aforementioned, the divergent class diagrams cannot be used in the same way by software engineers for understanding the system’s implementation structure. As a solution, reverse engineering methods/tools can reverse code into class diagrams. Yet, the reverse-engineered class diagrams, in most cases, are not abstract and with extensive information which will burden the software engineers’ understanding of the system’s implementation structure. This is due to the inability to provide input of manual abstraction characteristics for reverse engineering tools/methods. Thus, the tools/methods cannot manage to imitate humans to abstract relevant informa- tion only. Yet, the characteristics of manual abstraction can only be achieved by flexible manual studies based on the fact humans can jointly interpret the semantics conveyed by different model elements based on a full understanding of the relevant code implementation. Thus, the goal of this thesis aims at manually discovering the characteristics of manual abstraction created in the model elements. The goal of this thesis is motivated by the following holds: No actual case of the models and source code is studied in terms of manual abstraction characteristics: So far as we know, the existing studies on the characteristics of manual abstraction are based on the opinions and experiences of the participants, yet do not study an actual case of models and source code. The consistency checks of the differences and similarities between the design and code are purely structural and do not take semantics conveyed by model elements into account. Yet, semantics conveyed by model elements is key for studying the manual abstraction characteristics. Given that different systems have their own specified implementation structures, the desired functionalities require code structures that are interrelated and cooperated while also taking into account the application of the specific architecture and design patterns. These factors need to be considered jointly, which can only be achieved by flexible manual studies of models and source code. 1.4 Case Study Subjects In order to investigate the characteristics of the differences between the source code and class diagram, we employed the methodology of five case studies. It is necessary to access a dataset that includes a set of projects with class diagrams and corre- sponding source code. However, such a dataset is rare and difficult to access since the industrial models and source code is often not accessible for research. To address this issue, we decided to make use of models used in Free/Open Source Software (FOSS) projects. Thus, we used Lindholmen Dataset [1] created by Hebig et al. to do this thesis work. The Lindholmen Dataset [1] holds 3 295 open source projects of GitHub, which in- clude together 21 316 UML models. The model files are in two formats (images and 3 1. Introduction standard files). We first selected 5 projects in Java programming languages from that dataset as our study subjects. Then the selected class diagrams introduced to the project are limited to image format only, and we referred to them as manually created architectural models (mcAMs). The model elements we studied are classes, attributes, operations, and relationships (i.e., dependencies, usages, associ- ations, aggregations, compositions, inheritances, and realizations). 1.5 Definition of Terminologies We defined the following terminologies or used definitions given by others, enabling the reader to understand the research questions (RQs) formulated in section 1.6. As defined in the source [14], commit is a snapshot of changes made to the staging area, where holds the files to be included in the next commit. Version(s) of source code (voSC) is represented by a collection of source code files found in the repository after a commit (before next commit). Concepts are described by classes created in the mcAM. To be specific, a concept can be described by an abstract (super-) class of the class or a non-abstract (normal) class of the class. Note that regarding the definition of concepts, one can argue that concepts can be described by both classes and relationships between these classes from the mcAM. However, we argue that concepts are described only by classes from the mcAM. This is because the differences in attributes and operations from the classes between the mcAM and the source code would lead to differences in relationships. These correlated differences we aim to study. Map to a voSC refers to mapping one or more concepts described in the mcAM to one or more classes of that voSC. Note that human judgments are involved in the mapping since the naming of classes in the mcAM and source code might be different. A voSC v is conformable to the mcAM if for every concept described in the mcAM the following holds: for the concept a there is a Map to the voSC v, or there is a second concept b in the mcAM described by the superclass of the class that describes the concept a and for this second concept b there is a Map to that voSC v. The latter case can be illustrated by the following example: Example: This example is taken from the repository of one of our study projects, i.e., ZooTypers [15] on GitHub. As seen in Figure 1.2, there are five concepts in the mcAM: Pet, Dog, Cat, Fish, and Penguin, which are described by the superclass Pet, and four subclasses Dog, Cat, Fish, and Penguin that are derived from that superclass Pet, respectively. Figure 1.3 illustrates that a voSC that only includes 4 1. Introduction a class Pet that can be mapped to the superclass Pet of the class, e.g., Dog in the mcAM is conformable to the mcAM, since for the concept, e.g., Dog, there is a second concept Pet described by the superclass Pet of the class Dog that describes the concept Dog, and for this second concept Pet, there is a Map to that voSC. Figure 1.2: In the mcAM, for the concept, e.g., Dog, there is a second concept Pet is described by the superclass Pet of the class Dog that describes the concept Dog. Figure 1.3: For the superclass Pet of the class Dog that describes a second concept Pet of a conceptDog in the mcAM, the class Pet in this voSC is considered to be mapped to that superclass Pet of the class Dog in the mcAM. Conformable source code (cSC) refers to a voSC that is conformable to the mcAM. 1.5.1 Examples of Mappings between mcAM Concepts and voSC To illustrate how mappings are created between mcAM concepts and voSC, the fol- lowing examples are brought out: Note that regarding the following examples, for a concept a, the naming of the class b that describes the concept a in the mcAM and the naming of the mapped class c of 5 1. Introduction the class b in the voSC v might be different. However, class b and class c are ontolog- ically identical since they describe the same concept with highly similar attributes and operations and thereby, class b in the mcAM is considered to be mapped to class c in the voSC v. Case 1 - A class that describes a concept in the mcAM can be mapped to one class in the voSC. Example of Case 1: This example is taken from the repository of one of our study projects, i.e., ZooTypers [15] on GitHub. As observed in the comparison between Figure 1.4 and Figure 1.5, the naming of the class, i.e., SinglePlayModel in the mcAM differs from the naming of the class i.e., SinglePlayerModel in the voSC. This might be due to a misspelling of the name of the class SinglePlayModel in the mcAM. However, the class SinglePlayModel in the mcAM and the class SinglePlayerModel in the voSC are ontologically identical since they both have highly similar attributes and operations and thereby these two classes are considered to describe the same concept SinglePlayerModel. Therefore, the class SinglePlayModel in the mcAM is considered to be mapped to the class SinglePlayerModel in the voSC. Figure 1.4: A class SinglePlayModel describes a concept SinglePlayerModel in the mcAM, yet the naming of the class SinglePlayModel with a misspelling. Figure 1.5: The class SinglePlayerModel in this voSC has highly similar attributes and operations with the class SinglePlayModel in the mcAM and thus, the class SinglePlayModel is considered to be mapped to the class SinglePlayerModel in this voSC. Case 2 - A class that describes a concept in the mcAM can be mapped to more than one class in the voSC. Example of Case 2: This example is taken from the repository of one of our study projects, i.e., RaiseMeUp [16] on GitHub. Figure 1.6 illustrates that a concept dec- orators is described by a class ItemDecorator in the mcAM. Decorators are part of the decorator design pattern [17]. Thereby, the naming of the class ItemDecorator 6 1. Introduction and the relationships between ItemDecorator and the subclass Upgrade derived from the superclass Item possibly imply the decorator design pattern is applied in this project. One or more decorators of the decorator design pattern might be further planned to decorate the subclass Upgrade derived from the superclass Item in the mcAM. To confirm whether the decorator design pattern is applied in the voSC and the con- cept decorators created in the mcAM remains in the voSC, we checked the detailed implementation of the voSC illustrated in Figure 1.7. Then we can know that the classes Slot, LeftUpgradeSlot, and RightUpgradeSlot are indeed the classes related to decorators of the decorator design pattern. Also, these classes are the extension of the interface Icon embedded in the Java library. Note that a MVC architectural pattern is adopted in the voSC. With this as a basis, in the voSC, the subclass Upgrade derived from the superclass Item is a Model-related class, and the interface Icon is invoked in a View-related class that is responsible for communicating with that subclass Upgrade. Therefore, these three classes in the voSC are considered to be related to decorators and are considered to be mapped to that class ItemDeco- rator in the mcAM since the concept decorators indeed remains in the voSC, and thereby described by those three classes. Figure 1.6: In the mcAM, a concept, i.e., decorators is described by a class ItemDecorator, which might imply the decorator design pattern would be applied in the voSC. Figure 1.7: In this voSC, the concept decorators in the mcAM is described by the superclass Slot, and the subclasses LeftUpgradeSlot, and RightUpgradeSlot derived from the superclass Slot. These three classes are the extension of the interface Icon embedded in the Java library. This interface Icon is invoked in a class related to View, which is responsible for communicating with the subclass Upgrade derived from the superclass Item in the mcAM. Case 3 - In the mcAM, for concept a, there is a second concept b described by the superclass of the class that describes the concept a and for this second concept b, that superclass in the mcAM can be mapped to one class in the voSC. 7 1. Introduction Example of Case 3: This example is taken from the repository of one of our study projects, i.e., RaiseMeUp [16] on GitHub. Figure 1.8 illustrates that in the mcAM, for the concept, e.g., Dog, a second concept Pet is described by the superclass Pet of the class that describes the concept Dog. For the second concept Pet, there is a Map to this voSC, i.e., mapping that superclass Pet in the mcAM to one class Pet in the voSC (see Figure 1.9). Figure 1.8: In the mcAM, for the concept, e.g., Dog a second concept Pet is described by the superclass Pet of the class Dog that describes the concept Dog. Figure 1.9: This voSC only includes one class Pet that can be mapped to the superclass Pet of the class, e.g., Dog in the mcAM. Case 4 - A class that describes a concept a in the mcAM can be mapped to an additional superclass (not in the mcAM, but added to the voSC) that describes a concept a in the voSC (ignoring that one or more additional subclasses derived from that superclass (not in the mcAM, but added to the voSC as well) describe one or more additional concepts (not in the mcAM, but added to the voSC)). Example of Case 4: This example is taken from the repository of one of our study projects, i.e., Neuroph [18] on GitHub. Figure 1.10 illustrates that a concept, i.e., InputFunction is described by a class InputFunction in the mcAM. Figure 1.11 illustrates that this concept InputFunction remains in the voSC and is described by one additional superclass InputFunction (not in the mcAM, but added to the cSC). Therefore, the class InputFunction in the mcAM is considered to be mapped to that superclass InputFunction in the voSC. 8 1. Introduction Observed from the comparison between Figure 1.10 and Figure 1.11, an additional concept, e.g., Max not in the mcAM is added to the voSC and is described by an additional subclass Max (not in the mcAM is added to the voSC). For this additional concept Max, there is a second concept InputFunction described by the superclass InputFunction of the class Max that describes the additional concept Max in the voSC and for this second InputFunction, the superclass InputFunction is considered to be mapped to the class InputFunction in the mcAM. Figure 1.10: A concept, i.e., InputFunction is described by a class InputFunction in the mcAM. Figure 1.11: The concept InputFunction in the mcAM is described by a superclass InputFunction that not in the mcAM but added to the voSC. 1.5.2 Ideal Selection of One cSC among Multiple cSCs of a mcAM In a project’s GitHub repository, commits are throughout the SDLC as the code evolves. Thus, a project has different voSC as the code evolves. A mcAM may have one or more cSCs, i.e., one or more voSCs that are conformable to the mcAM. However, considering the time constraints and we want to study more projects, we decided to select only one cSC of them. Compared with other cSCs of the mcAM, this cSC should ideally cover the most attributes and operations associated with the concepts in the mcAM. However, this cannot be guaranteed since it is not possible for us to check the voSC one by one (referring to the detailed methodology employed illustrated in section 3.2). This will lead to a threat, which will be illustrated in section 5.6. To illustrate how one cSC among multiple cSCs for a mcAM is selected, the follow- ing example is given: This example is taken from the repository of one of our study projects, i.e., EAPLI_PL_2NB [19] on GitHub. Figure 1.12 illustrates in the mcAM there are eight concepts that are described by eight classes, i.e., BaseUI, BaseController, IncomeRegisterUI, Reg- isterIncomeController, Income, IncomeTyperRepository, CheckingAccount, and In- comeRepository, respectively. 9 1. Introduction Figure 1.12: In the mcAM, eight concepts are described by eight classes, i.e., BaseUI, BaseController, IncomeRegisterUI, RegisterIncomeController, Income, IncomeTyperRepository, CheckingAccount, and IncomeRepository, respectively. 10 1. Introduction Below are two examples of two cSCs of the mcAM depicted in Figure 1.12. As observed in the comparison between Figure 1.13 and Figure 1.14, these two cSCs cover all eight concepts in the mcAM. The only difference between these two cSCs is the cSC represented by Figure 1.14 covers four more operations (underlined in blue) than the cSC represented by Figure 1.13. Thus, we would ideally want to select the cSC illustrated in Figure 1.14. Figure 1.13: A cSC of the mcAM covers all eight concepts in the mcAM. Figure 1.14: Compared with the cSC illustrated in Figure 1.13, this cSC of the mcAM covers four more operations. 1.6 Research Questions To reach the goal of this thesis of studying the characteristics of manual abstraction, we formulated the following research questions: RQ1: Does the cSC of the mcAM cover all elements planned out in that mcAM? As mentioned in section 1.5, if a mcAM has multiple cSCs, we will select only one cSC of them. A possibility is that the selected cSC cannot cover all elements (i.e., 11 1. Introduction attributes and operations) planned out in that mcAM. This will cause the differ- ences between the cSC of that mcAM and that mcAM. RQ2: What causes the differences between the cSC of the mcAM and that mcAM? With the answer to RQ2, we can get a list of cases that cause the differences between the cSC of the mcAM and that mcAM. Could these cases be categorized into some common causes? RQ3: What are the differences between the cSC of the mcAM and that mcAM? With the answer to RQ3, we can conclude some common causes that cause the differences between the cSC of the mcAM and that mcAM. 1.7 Contributions The results of this thesis provide a sorted list of cases that cause the cSC to devi- ate from the mcAM and a sorted list of suspected cases inferred by these observed cases. These suspected cases are considered to exist possibly and would also lead to the differences between the mcAM and the cSC. In accordance, this thesis will have the following implications for the reverse engineering and modeling community: 1. To provide input for future improvement of reverse engineering meth- ods/tools in terms of abstractness: So far as we know, reverse engineering is imperfect as it does not manage to imitate the human ability to abstract relevant information from the source code. Guéhéneu [5] proposed that no existing main- stream reverse engineering tool produces abstract yet precise class diagrams. The concluded cases of creating various manual abstraction on the model elements can be used as such input. 2. To provide input for developing mapping rules which can be used for the consistency check(s) between the model design and code implementa- tion: Existing methods/technologies developed for the consistency checks between code and design are purely structural and do not take the semantics conveyed by the model elements into account. Yet, the semantics is closely related to the manual ab- straction characteristics. Thereby, the sorted list of cases of the differences between the code and design, which was yielded from manually studying five Java projects based on interpreting the semantics of the model elements, can provide such input. 3. To provide a guideline for designing model elements to avoid over- abstraction and over-specification: Given the abstract nature of the model, the model elements can be modeled at various levels depending on the design decisions made by architects. When it comes to design decisions for creating different elements of the class diagram during the design stage, little is known about which design decisions are inclined to be acceptable and unacceptable by developers in the code implementation. This can result from over-specifying the model elements yet losing the abstractness. Thereby, the developers disagree with these design decisions made by architects, and they make different decisions in the code implementation. On 12 1. Introduction the other hand, this can also result from over-abstracting the model elements. This leads to vague design decisions that cannot be accepted by developers in the code implementation, given that the developers need to settle these vague design decisions down and further specify the detailed implementation for these decisions. Then the deviations of the code from the design come up. Thus, the concluded cases of the differences caused by developers’ deviations from the architects’ design decisions allow us to create this guideline. 1.8 Structure of the Paper This thesis presents a systematic manual study of the characteristics of the differ- ences between the mcAM and one cSC of that mcAM by analyzing five open-source Java projects on GitHub. The structure of this thesis is outlined as follows: In Chapter 2, the relevant theoretical knowledge and early research done by others are described. Chapter 3 details the methodology of the five case studies employed. The results of this thesis work are illustrated in Chapter 4. The threats to the validity of this thesis are described and discussed in Chapter 5. This thesis work is concluded, and the future work of this thesis is suggested in Chapter 6. 13 1. Introduction 14 2 Theory and Related Work In this chapter, in order to help the reader understand the work of this thesis, the relevant theoretical knowledge of Java and UML is first described. This lays the ground for understanding the constituents of a mcAM and thereby every constituent in that mcAM can be mapped to the corresponding constructs of one cSC of that mcAM. After that, related early work done by other authors on reverse engineering and models in open-source systems is presented and discussed. The former work provided the inspiration for this thesis and from which this thesis originated. The study subjects of this thesis relied on the outcomes of the latter work. 2.1 Theory In order to detect the differences between the mcAM and the cSC of that mcAM, for each model constituent in the mcAM, there must be a map to the corresponding construct(s) of the cSC of that mcAM. Only with knowledge of Java and UML can one understand how to map every element in a mcAM to the corresponding con- struct(s) of a cSC of that mcAM. As Java and UML evolve, multiple versions of their specifications exist at different times. On the other hand, the mcAM and the first voSC found in the repository of each project were created at different times. For the five projects studied, in order to ensure that these mcAMs and voSCs included in the projects matched the appro- priate versions of Java and UML specifications, respectively, it is critical to identify when the earliest mcAM and voSC were created in the repository. This is because the creation date of the Java specifications and UML specifications used should be ideally as close as possible to the creation date of the earliest created mcAM and voSC, and in turn, later versions have new updates that may not adapt to these mcAMs and voSCs. After checking, the earliest mcAM and voSC were created on July 8, 2011, and August 24, 2011, respectively. In consequence, Java SE7 speci- fications (released in July 2011) [20] and UML v2.4.1 superstructure specifications (released in July 2011) [21], respectively, were selected as the basis for this study. 2.1.1 Related Java Knowledge Relevant Java knowledge needs to be understood, including Java syntax/specifications and the OO paradigm. Of particular interest part Java specifications, along with several related concepts in the OO paradigm, which can help to understand how 15 2. Theory and Related Work abstraction is able to be created over the cSC. 2.1.1.1 Java SE7 Specifications Of particular interest part of Java SE7 specifications is illustrated in the following: Framework is a set of classes and interfaces which provide a ready-made architec- ture [22]. Collection framework provides the ready-made classes and interfaces needed to represent a group of objects (also called instances) as a single entity in Java [22]. For example, the Map interface with the corresponding classes, e.g., HashMap, is used to present a group of instances. 2.1.1.2 Object-Oriented (OO) Paradigm Several related concepts of the OO paradigm are described in the following, accord- ing to the source [20, 23, 24, 25, 26, 27]. Object are often referred as an instance or an array of a class in Java and all objects created belong to a certain class [20, 25]. Objects are an encapsulation of information and behavior relative to some entity of the application domain under consideration [25]. In real systems many objects with similar information (data) and behavior (functionality) can be found [25]. Class captures those objects with similar information (data) and behavior (func- tionality) and classes can be viewed as an abstract data type [25]. Class is defined as including at least two types of features: attributes (also called variables, fields or data members), which stand for the stored information and methods (also called operations or function members), which represent the behavior [25]. Encapsulation is a technique for minimizing interdependencies among separately- written modules by defining strict external interfaces [26]. An encapsulated module can only be accessed by clients (that is, other modules that make use of this module) via this interface [27]. Implementation details are “hidden” within the module. The primary reason for requiring encapsulation is to make it possible to change (improve) the implementation of a module without having to change (and/or recompile) the module’s clients [27]. Take Java as an example. For the encapsulation of attributes included in a class, all attributes about that class should be set to private unless they are specifically declared public [25]. The public setter and getter operations set for the attributes of a class are called its interfaces and should only be the “tip of the iceberg” with the hidden part that is called the implementation [25]. Those interfaces of a class allow the supplier class to render the values of those attributes to the customer class [25]. 16 2. Theory and Related Work Inheritance allows the subclass that extends the superclass to be arranged in a hierarchical structure [24, p. 451], and thereby a subclass to take on the general at- tributes and operations of that superclass in the inheritance chain so that attributes and operations then form part of the definition of the subclass for code reuse [23, p. 63]. 2.1.2 Unified Modeling Language (UML) UML is a modularly structured language that can provide specific components of primary interests for a specific domain or application [21, p. 1]. UML is a de facto standard formalism for software design and analysis [6]. With some existing case tools such as Enterprise Architect [28] and IntelliJ IDEA [29], specific constituents of UML can be handily visualized to accommodate the specific requirements. Of particular interest are class diagrams used for modeling the information on the do- main of interest in terms of objects (instances) organized in classes and relationships between them [6]. Thus, the specific constituents of UML that are most likely to be required in most cases for constructing a class diagram are classifiers (classes), classifiers’ (classes’) embedded text notation for attributes and operations, and rela- tionships between classifiers (classes). As mentioned above, they can all be visualized with a case tool; these constituents are detailed separately in this section. 2.1.2.1 Graphical Notation for Classifiers Classifier refers to a classification of instances describing a set of instances that have features in common [21, p. 51], in which the textual notation for attributes and operations is embedded. Figure 2.1 presents examples of graphical notation for a class Window at three dif- ferent levels of abstraction: suppressed (on the top left corner), analysis (on the right), and implementation (on the bottom left corner) [21, p. 50]. However, for different systems, at which level or in the fluctuations between these levels a class is constructed actually depends on different design decisions made by different archi- tects during the design stage. In some cases, they may determine to model a specific part of the system that is of primary interest at a low level (e.g., an implementation level) to give developers more insight into that part of the system during the imple- mentation stage. On the contrary, for the part of a system that is of less interest, they may determine to model that part of the system at a higher level relative to the level of implementation (e.g., a suppressed or an analysis level). As seen in Figure 2.1, if the class Window is constructed at an analysis or implemen- tation level, details of its embedded textual notation for attributes and operations are (more or less) laid out. However, the meta-textual notation defined for them in [21] is far more comprehensive than any of the three illustrated in Figure 2.1. Take Figure 2.1 as an example, aiming at leaving the reader with an impression of what a class is possibly like at various abstraction levels. That means in general, the layout of a graphical class is composed of three primary sections. These three sections are elaborated in the following with the aid of a given example shown on the right of 17 2. Theory and Related Work Figure 2.1. • Upper section (mandatory): Contains the name of the class, e.g., Window [30]. • Middle section (optional): Contains one or more attributes of the class Window, and they are used to describe the qualities of Window [30], e.g., the attribute size: Area is used to describe the size of an instance of the class Window. Noteworthy, this section is only required when describing a specific instance of a class [30]. • Bottom section (optional): Includes class operations displayed in list for- mat, each operation, e.g., hide() takes up its own line [30]. The opera- tions describe how a class interacts with data [30], e.g., for the operation attachX(xWIN: XWindow), the class Window references class XWindow as a parameter data type, and thereby an interaction between the class XWindow and the class Window comes out. Figure 2.1: Three abstraction levels of graphical class notation: suppressed (on the top left corner), analysis (on the right), and implementation (on the bottom left corner) [21, p. 50]. Compared with the suppressed level with only the specified name of a class, another two levels present a more comprehensive layout of a class with more or less) textual notation for attributes and operations embedded. 2.1.2.2 Conversions of BNF To standardize the textual notation for attributes and operations embedded in the classifier, legal formats are first specified, i.e., the Backus-Naur Form (BNF) con- versions (as depicted in Figure 2.2). The legal formats make the textual notation for attributes and operations more easily interpreted. Note that the specification of BNF conversions applies to both earlier serial UML v1.0 specifications series and the latest serial v2.0 specifications series. 18 2. Theory and Related Work Figure 2.2: Specification of Backus-Naur Form (BNF) conversions [21, pp. 16–17]. 2.1.2.3 Textual Notation for Attributes Reference to [21, pp. 129–130], the notation for attributes defined is depicted in Figure 2.3. Note that attributes are a legacy terminology of the earlier UML v1.0 specifications and are referred to as properties in the UML v2.4.1 superstructure specifications. Property (denoted in Figure 2.3) and attribute are ontologically identical. Termi- nology attributes are used in this thesis. Figure 2.3: Textual notation for attributes [21, pp. 129–130]. 19 2. Theory and Related Work This thesis work focuses on five constituents of the attributes’ textual notation: name, prop-type, multiplicity, default and visibility [21, p. 129] (as depicted in Fig- ure 2.3, underlined in pink). These five constituents are referred to as name, attribute type, multiplicity, default value, and visibility, respectively, in this thesis. The def- initions of these constituents used in this thesis and the reasons why they are of interest are detailed in the following. Note that although their corresponding definitions are depicted in Figure 2.3, some of them may still need to refer to earlier UML v1.5 specifications [31]. In this way, these two versions of definitions are able to complement each other. This will en- hance the comprehensibility of those constituents. • Reference to the definition of name in the UML 1.5 specifications, Chapter 3, Part 5, Section 3.25 “Attribute”, name is an identifier string, usually a simple word, to represent an attribute [31, p. 42]. Names are mandatory for attributes’ textual notation. Names are not mere identifiers for attributes; in particular, they carry relevant semantics related to the static data structure of the classifier [32]. Semantics preservation is a main objective of the refinement of design into code [32]. Thereby, attributes names are expected to enhance our comprehensibility on mappings between the attributes in the mcAM and cSC. That means based on the semantics related to the attributes in the mcAM, their corresponding attributes in the cSC that represent similar semantics are able to be identified. Note that considering attributes in the mcAM can be referred to as either variables or constant variables in the cSC. In accordance, an attribute in the mcAM can be modeled as as variable or a constant variable that will be im- plemented in the cSC. Referring to Java naming conversions, the naming of variables should be in camel case [33] and the naming of the declared class constant variables should be all in uppercase letters with words separated by underscores (“_”) [34]. These naming conversions are represented in the same way as they are in the mcAM. Thus, by observing the naming conversions of attributes in the mcAM, one can infer whether an attribute in the mcAM can be mapped to a variable or a constant variable in the cSC. • Reference to UML v1.5 specifications, Chapter 3, Part 5, Section 3.25 “At- tribute” [31, p. 42], attribute type refers to either name of the classifier or a language-dependent string that maps into a primitive data type in Java. There are two reasons for us to study the differences in attribute types. The first reason is that the attribute types might be related to relationships between classifiers (also instances). For example, the relationships of association, ag- gregation, and composition between classifier A and B should first satisfy that classifier A references classifier B as the type of an attribute included in clas- sifier A. Then, to confirm exactly the relationship between classifiers A and 20 2. Theory and Related Work B, based on the definitions of relationships in this section, it demands us to check the detailed implementation of the cSC. Although the attribute type (name of the classifier) is critical for understand- ing the relationships between classifiers, the attribute type is an option (as illustrated in Figure 2.3). In essence, the possibility of omitting the attribute types by architects when designing the mcAM cannot be excluded. This is the second reason. The omission of the attribute types is a kind of abstraction. Opposite to this case, there would be two other cases caused by developers’ disagreements in the implementation of the cSC with respect to the abstrac- tion created by architects in the mcAM. These two cases are 1. The attribute type in the mcAM is removed in the cSC. 2. The attribute type in the mcAM and cSC is specified differently. • Multiplicity is specified in the textual notation for both attributes and op- erations. Thereby, its definition and the reason why it is a focus in this thesis are detailed separately in subsubsection 2.1.2.6. • Default value is an expression that evaluates to the default value or values of the attribute [21, p. 113]. Referring to the notation for attributes depicted in Figure 2.3, the default value is an option. Thus, assume a possibility that when designing attributes/variables in the mcAM, some architects may intend to omit the specification of default values and choose to leave them out for de- velopers to initialize the variables in the cSC. The reason for this is that these architects may have taken into account the need to cater to new requirements in the future, and thus the values of variables will be updated by developers one or more times during the implementation of the cSC. This will cause the default value not in the mcAM to be added to the cSC. Opposite to the omission of the default values of the attributes (variables), some architects may over-specify the default values of attributes (variables) in the mcAM. However, developers may disagree with this, and they choose to remove the specified values in favor of other approaches, such as adopting pub- lic setter operations of these attributes (variables) to initialize the attributes (variables) and update their values one or more times in the cSC. Note that the prerequisite for using public setter operations is that the attributes (vari- ables) declared in the classifier should be set to private. On the other hand, for the design of attributes (constant variables) in the mcAM, considering that a constant variable should be assigned a value once across the life-cycle of the program, a query is whether the architects intended to specify a default value for a constant variable in the mcAM or not. As mentioned previously, there would be a case - the differences in attributes caused by the conversion of variables and constant variables between the mcAM and cSC. Thus, another query as to whether this case has an impact on 21 2. Theory and Related Work the specification of default values in the mcAM and cSC. As we all know, the variables are likely to be updated one or more times in the cSC. In particular, assume that variables in the mcAM, which may not hold default values, are converted to constant variables with default values assigned in the cSC. This will lead to another case - the default value not in the mcAM is added to cSC. However, differing from the former case, this case is considered to be caused by the deviations between the implementation of the cSC and the design of the mcAM. • Visibility is specified in the textual notation for both attributes and opera- tions. Thereby, its definition and the reason why it is a focus in this thesis are detailed separately in subsubsection 2.1.2.5. Except for the five constituents of attributes’ textual notation illustrated above, the other two constituents of ‘/’ and prop-modifier(attr-modifier) are not considered to be the focus due to the absence of the relevant cases in the five projects studied. 2.1.2.4 Textual Notation for Operations Reference to [21, pp. 107–108], the textual notation for operations defined is de- picted in Figure 2.4. Figure 2.4: Textual notation for operations [21, pp. 107–108]. This thesis work focuses on four constituents of the operations’ textual notation: 22 2. Theory and Related Work name, parameter-list, return-type, and visibility [21, p. 107] (as shown in Figure 2.4, underlined in pink). They are referred to name, parameter list, and return type, and visibility, respectively, in this thesis. The definitions of these constituents used in this thesis and the reasons why they are of interest are detailed in the following. Note that although their corresponding definitions are depicted in Figure 2.4, some of them may still need to refer to earlier UML v1.5 specifications [31]. In this way, these two versions of definitions are able to complement each other. This will en- hance the comprehensibility of those constituents. • Reference to the definition of name in the UML 1.5 specifications, Chapter 3, Part 5, Section 3.26 “Operation”, name is defined as an identifier string to represent an operation [31, p. 44]. Names are mandatory for the operations’ textual notation. Names are not mere identifiers for operations; in particular, they carry relevant semantics related to the behavioral status of the classifier [32]. Semantics preservation is a main objective of the refinement of design into code [32]. Thereby, op- erations names are expected to enhance our comprehensibility on mappings between the operations in the mcAM and cSC. That means based on the se- mantics related to the operations in the mcAM, we are able to identify their corresponding operations in the cSC that represent similar semantics. Some- times, the semantics of the operations in the mcAM might remain in the cSC. However, for the operations, their implementation in the cSC might deviate from their design in the mcAM. This is because developers disagree with the design decisions of operations made by architects in the mcAM; rather, they make different decisions of operations in the cSC. The characteristics of such deviations in operations between the mcAM and cSC are what this thesis aims to study. • Parameter list is defined as a list of parameters of the operation [21, p. 108] (as depicted in Figure 2.4). Considering Java syntax, i.e., the construct of an operation, e.g., a parame- ter cannot be specified with a default value. Thereby, the default value will be excluded in this thesis work. For the parameter, none of the five projects studied have relevant cases with respect to the constituents (i.e., parm-property and direction). Thus, these two constituents will also be excluded. To this end, only three constituents (i.e., parameter-name and type-expression, and multiplicity) of the parameter are the focuses of this thesis work. They are referred to parameter name and paramter type, and multiplicy, respectively, in this thesis. – Reference to the definition of parameter name in the UML 1.5 specifica- tions, Chapter 3, Part 5, Section 3.26 “Operation”, parameter name is defined as an identifier string to represent a parameter [31, p. 45]. 23 2. Theory and Related Work Parameter names are mandatory for the specification of operations. How- ever, compared with the names specified for attributes and operations, here the parameter names are considered more related to identifiers for operations parameters. The reason for this is that the semantics of the names specified for operations can help us already to do mappings be- tween the operations in the mcAM and cSC. – Parameter type is defined as an expression that specifies the type of the parameter [21, p. 108] (as depicted in Figure 2.4). The parameter type can be either primitive data type or nonprimitive data type that is represented by name of the classifier. For the specification of parameters, compared with the parameter names, parameter types should be taken more concern, since the parameter type(s) of an operation is(are) related to the relationships between classi- fiers. For example, classifier A references classifier B as a parameter type. This implies a dependency between classifiers A and B (i.e., classifier A depends on classifier B for its implementation). However, if the parame- ter type specified in the mcAM is changed or removed in the cSC, which will further lead to changes in the relationships between these classifiers (or involving other classifiers). Such differences in relationships between classifiers in the mcAM and cSC are what this thesis aims to study. – Multiplicity (referring to the subsubsection 2.1.2.6) • Reference to UML v1.5 specifications, Chapter 3, Part 5, Section 3.25 “Oper- ation”, return type is defined as a language-dependent specification of the implementation type (i.e., the primitive data type in Java), or types of the value returned by the operation (i.e., the nonprimitive data type in Java) [31, p. 45]. Note that the colon and the return type are omitted if the operation does not return a value (as for Java void) [31, p. 45]. Thereby, if no return value is specified for an operation in the mcAM, yet void is shown as a return type in the cSC, we do not regard this change as a difference in return types between the mcAM and cSC. The reason for us to study the differences in return types is that the differences in return types might lead to the changes in relationships between classifiers in the mcAM and cSC. For example, classifier A references classifier B as a return type. This implies a dependency between classifiers A and B (i.e., classifier A depends on classifier B). If the return type from classifier B is changed or removed as a void, this will further lead to changes in the relationship between these classifiers (or involving other classifiers). Such differences in relation- ships between classifiers in the mcAM and cSC are what this thesis aims to 24 2. Theory and Related Work study. • Visibility (referring to the subsubsection 2.1.2.5) Except for the four constituents of operations’ textual notation illustrated above, the other constituent oper-property is not the focus of this thesis, since there is no relevant case associated with the five projects studied. 2.1.2.5 Visibility Reference to UML 1.5 specifications, Chapter 2, Part 2, Section 2.5 “Core” [31, pp. 35–36], the definitions of feature and visibility are given in the following: Feature is defined as an attribute or operation, which is encapsulated within a classifier [31, p. 35]. Visibility is defined as specifying whether Feature can be seen and referenced by other classifiers [31, p. 36]. Four types of visibility and their denotations are illustrated as below: • Public (denoted by the symbol ‘+’) - Any outside classifiers with visibility to classifier A can use the Feature of classifier A [31, p. 36]. • Protected (denoted by the symbol ‘#’) - Any descendent of the classifier A can use the Feature of classifier A [31, p. 36]. • Private (denoted by the symbol ‘−’) - Only the classifier A itself can use the Feature itself, or nested classifier B within classifier A can use the Feature of classifier A [31, p. 36]. • Package (denoted by symbol ‘∼’) - Any classifier declared in the same package (or a nested subpackage, to any level) as the owner of the Feature can use the Feature [31, p. 36]. Osman et. al [35] proposed that software engineers prefer to leave Private Opera- tions and Protected Operations out, to make a class diagram simplified. However, it has not been validated in a case, and to what extent they are left out is unknown. Thus, this thesis indeed wants to fill this gap. Considering the encapsulation of at- tributes in the OO paradigm, one inquiry is for simplifying a class diagram, whether the public set for the setter and getter operations of those private attributes are preferred to be excluded or even public for those operations and private for those encapsulated attributes are all excluded. In another source [36], Osman et. al investigated that “counting the number of public operations” is the most important metric for indicating the importance of a class. This is from a reverse direction to analyze how to recover an important 25 2. Theory and Related Work class for a reverse-engineered class diagram; rather, from the forward direction, for (both important and secondary important) classes, whether the exclusion of the public set for operations relates to the types of operations is a question (based on the three types of operations defined in this thesis: constructors, setter or getter operations, and operations (besides setter and getter operations, which is detailed in section 4.3). Thus, those inquiries motivate this thesis to study the visibility in both attributes and operations. 2.1.2.6 Multiplicity Multiplicity is defined as an inclusive interval of non-negative integers beginning with a lower bound and ending with a (possibly infinite) upper bound [21, p. 95]. The textual notation for multiplicity specified by BNF is depicted in Figure 2.5. Only the multiplicity range is the focus since the other constituents of multiplic- ity are absent in the five mcAMs studied. Considering the Java specifications, the multiplicity in the mcAM can be either an array or a collection, and then the mul- tiplicity range should be [0,+∞] [7]. Multiplicity is specified in the attributes and operations for three constituents that are related to data types. These three constituents are attribute type, parameter type, and return type. There are two reasons for focusing on the multiplicity of these three constituents. The first reason is that there might be differences in the inter- face types provided in the collection, e.g., Set in the mcAM changed into List in the cSC. The second reason is that the multiplicity not in the mcAM might be added to the cSC, e.g., name of the classifier in the mcAM is changed into List in the cSC. Multiplicity is a property of these three constituents and represents the number of instances of the classifier, so multiplicity along with the constituents, should be considered in parallel. On the other hand, the multiplicity and the relationships of aggregation and composition are correlated (referring to their corresponding definitions and se- mantics in subsubsection 2.1.2.7). Note that the multiplicity placed at the end of an association is not the focus of this thesis, as it is not easy to manually check how many invocation sites of the instances in the cSC. Figure 2.5: Syntax for a multiplicity string [21, p. 98]. 26 2. Theory and Related Work 2.1.2.7 Graphical Notation for Relationships The aim of this thesis is to study seven types of relationships of the mcAM: depen- dencies, usages, associations, aggregations, compositions, inheritances (also known as generalizations), and realizations. Reference to UML v1.5 specifications, Chapter 3, Part 5 [31, pp. 34–93], and UML v2.4.1 specifications [21], their definitions are given as follows (note that due to the unavailability of clear definitions of some re- lationships, their definitions should be better detailed with their semantics in order to enhance their comprehensibility): Dependency is defined as a relationship that relates the model elements (con- stituents) themselves and does not require a set of instances for its meaning [31, p. 90]. Dependency signifies that a class requires another class for its specification or implementation [21, p. 61], so dependency indicates a situation in which a change to the target element may require a change to the source element in the dependency [31, p. 90]. Usage is defined as a relationship where one class requires another class for its full implementation or operation [21, p. 139]. Note that in the metamodel, Usage is a Dependency in which the client requires the presence of the supplier. A binary association refers to an association among exactly two classes (includ- ing the possibility of an association from a class to itself) [31, p. 68]. Aggregation refers to a type of whole/part of a binary association relationship. Composition refers to a strong form of aggregation that requires a part instance to be included in at most one composite at a time [21, p. 38]. If a composite is deleted, all of its parts are normally deleted with it [21, p. 38]. Inheritance refers to a taxonomic relationship between a more general superclass and a more specific subclass [21, p. 38]. Each instance of the subclass is also an in- direct instance of the superclass [21, p. 70]. Thus, the subclass inherits the features of the superclass [21, p. 38]. Realization refers to a relationship between a class and an interface implying that the class supports the set of features owned by the interface and any of its parent interfaces [21, p. 89]. The graphical notation for the seven types of relationships defined above is depicted in Figure 2.6. Of particular note, according to the definitions of the relationships of dependency, association, aggregation, and composition, four corresponding levels (from 1 to 4) of abstraction defined are depicted in Figure 2.7. The higher the level of abstraction, the lower the level. 27 2. Theory and Related Work Figure 2.6: Graphical notation for seven types of relationships. 28 2. Theory and Related Work Figure 2.7: Four abstraction levels specified for four relationships - dependency, association, aggregation, and composition, respectively. 2.2 Related Work This section consists of two parts. The first part is related to the currently existing reverse engineering methods/tools. The second part describes the usage of models in the software development practice. 2.2.1 Reverse Engineering Müller et al. [37] refined the definition of reverse engineering in [38] as “a process of analyzing a subject system to identify its current components and their interre- lationships and to extract and create system abstractions and design information.” Reverse engineering methods/tools play a key role in legacy systems based on the absence of a design. In particular, class diagrams are often poorly to be updated during development and maintenance [39, 13]. There is a concern that such a diver- gent class diagram cannot help software maintainers in the same way understand the system’s architecture during maintenance later on. Thereby, as a solution, reverse engineering methods/tools can be used to automatically generate reverse-engineered class diagrams that are extracted from the current code. Such a reverse-engineered class diagram can represent the up-to-date system’s architecture. 2.2.1.1 Manual Abstraction Created over Code The class diagrams produced during the design and implementation phases of the SDLC can be referenced by software maintainers during the maintenance phase to understand the system’s architecture. However, in some cases, class diagrams may contain volumes of information [35]. This makes it hard for software maintainers to understand the system’s architecture [35]. Thereby understanding how abstraction is manually created by software engineers and thus condensing/simplifying class diagrams is essential. For this purpose, Osman et al. [35] conducted a survey to investigate how manual abstraction is created over code. This survey involves 32 software developers, with 75% of the participants having more than 5 years of expe- rience with class diagrams [35]. As a result, they found the important elements in a class diagram are class relationships, meaningful class names, and class properties 29 2. Theory and Related Work [35]. Also, the information that should be excluded in a simplified class diagram is GUI-related information, Private and Protected operations, Helper classes, and Library classes [35]. However, these findings are needed to be validated in a case. The five case studies employed in this thesis can help validate these findings to some extent. 2.2.1.2 Solutions and Attempts to Provide Abstraction for Reverse- Engineered Class Diagrams To make reverse-engineered diagrams both abstract and precise is a primary goal in the reverse engineering community. Thereby, concepts related to diagrams conden- sation/simplification or diagrams abstractness are first proposed along with several technologies developed later on aiming at archiving this goal. Guéhéneu [5] proposed that both abstract and precise reverse engineering tools do not yet exist on the market. Guéhéneu [7] started by developing a tool named PTIDEJ aiming to produce precise reverse-engineered class diagrams, in particular, to infer use, association, aggregation, and composition relationships based on the consideration of lacking clear definitions of those relationships. PTIDEJ [7] per- forms even more accurately than class diagrams manually created by humans. Sub- stantially, Guéhéneu [5] argued that the lack of abstraction with respect to current existing reverse engineering tools is because of the lack of clear definitions of class di- agrams’ constituents. Thereby, Guéhéneu systematically studied constituents of the class diagrams in reference to UML 1.5 specifications and refined their definitions. Guéhéneu [5] then exemplified the study with PTIDEJ to reverse Java programs as UML diagrams abstractly and precisely. We agree with the assertion in [5] proposed by Guéhéneu, i.e., the definitions of some constituents of class diagrams are vague and verbose. Thus, the refinements of definitions in [5, 40] by Guéhéneu helped this thesis work a lot in regard to mappings between mcAM constituents and cSC constructs. The lack of abstraction with respect to reverse engineering tools/methods is pro- posed by other authors as well, according to the source [35, 41, 39]. The resultant diagram generated by reverse-engineering methods/tools is often very cluttered [41]. This is of little help to software engineers in understanding the system’s architecture since it is hard for them to locate the key places of primary interest. Regarding condensation of reverse-engineered class diagrams to enhance their com- prehensibility, Osman et al. [42] proposed an approach by using a supervised clas- sification algorithm where design metrics (e.g., number of operations, number of attributes, etc.) as the input. Yet, an elemental question left out is which elements of the system’s architecture should be selected for accommodating various levels of abstraction [39]. An extension of Osman et al.’s work is conducted by Thung et al. They