Creating a Bi-Directional Source-to-Source Compiler Using MDE Transformation Tech- niques A Proof-of-Concept Using the General-Purpose Programming Languages COBOL and C++ Sebastian Blomberg and Joel Severin Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2017 Master’s thesis 2017 Creating a Bi-Directional Source-to-Source Compiler Using MDE Transformation Techniques A Proof-of-Concept Using the General-Purpose Programming Languages COBOL and C++ Sebastian Blomberg and Joel Severin Department of Computer Science and Engineering Division of Software Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2017 Creating a Bi-Directional Source-to-Source Compiler Using MDE Transformation Techniques A Proof-of-Concept Using the General-Purpose Programming Languages COBOL and C++ Sebastian Blomberg and Joel Severin © Sebastian Blomberg and Joel Severin, 2017. Supervisor: Regina Hebig, Department of Computer Science and Engineering Examiner: Jan-Philipp Steghöfer, Department of Computer Science and Engineer- ing Master’s Thesis 2017 Department of Computer Science and Engineering Division of Software Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Gothenburg, Sweden 2017 iv Creating a Bi-Directional Source-to-Source Compiler Using MDE Transformation Techniques A Proof-of-Concept Using the General-Purpose Programming Languages COBOL and C++ Sebastian Blomberg and Joel Severin Department of Computer Science and Engineering Chalmers University of Technology Abstract In this thesis, we discuss how a bidirectional source-to-source compiler, for the declarative general-purpose languages COBOL and C++, can be implemented us- ing Model-Driven Engineering (MDE) tools and practices. The outcome of our work is primarily an approach for implementing said bidirectional compiler using formal grammars, bidirectional transformation languages, and a developed concept model. This approach also illustrates how a programmer’s intent can be transferred between languages. In order to evaluate the approach, a prototype was realized using Ecore, Xtext, and Medini QVT. In the process, a library for emulation of COBOL data types in C++ was also developed. Finally, we conclude the success of the developed approach and prototype. Keywords: source-to-source compiler, transpiler, translator, bidirectional transfor- mation, model-driven engineering, COBOL, data type emulation. v Acknowledgements We would like to thank our supervisor Regina Hebig for steering us in the right direction every time we lost track or had questions. We would also like to thank everyone else that helped us improve our work, including our supervisor Jan-Philipp Steghöfer, and our opponents Peter Eliasson and Jakob Csörgei Gustavsson. Kindly, thank you. Joel Severin and Sebastian Blomberg, Gothenburg, June 2017 vii Contents 1 Introduction 1 1.1 Motivation in Relation to Industry . . . . . . . . . . . . . . . . . . . 1 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Delimitations and Limitations . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Formal Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Metamodeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Bidirectional Transformation Languages . . . . . . . . . . . . . . . . 11 2.4.1 Triple Graph Grammars . . . . . . . . . . . . . . . . . . . . . 11 2.4.2 QVT Relational . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.3 Text-based approaches . . . . . . . . . . . . . . . . . . . . . . 13 3 Related Work 15 3.1 Bridging Grammars and Metamodels . . . . . . . . . . . . . . . . . . 15 3.1.1 Xtext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Translation of General-Purpose Languages . . . . . . . . . . . . . . . 18 3.2.1 Early Attempts at Bidirectional Translation . . . . . . . . . . 18 3.2.2 The Idea of Concepts Common to Code between Languages . 20 3.2.3 Formalizing Bidirectional Translation . . . . . . . . . . . . . . 20 4 Method 23 4.1 Survey of Cobol Language Construct Frequency . . . . . . . . . . . . 24 4.2 Choosing a Target General-Purpose Language . . . . . . . . . . . . . 24 4.3 Emulating COBOL Data Types in C++ . . . . . . . . . . . . . . . . 24 4.4 Creating a Model-Driven Source-to-Source Compiler . . . . . . . . . . 25 4.4.1 Creation of an Xtext COBOL Grammar . . . . . . . . . . . . 26 4.4.2 Creation of an Intermediate Model . . . . . . . . . . . . . . . 27 4.4.3 Specifying Transformations in QVT-R using Echo . . . . . . . 27 4.4.4 Specifying Transformations in Medini QVT . . . . . . . . . . 28 4.4.5 Creation of an Xtext C++ Grammar . . . . . . . . . . . . . . 28 4.4.6 Evaluating Results . . . . . . . . . . . . . . . . . . . . . . . . 28 ix Contents 5 Survey of Cobol Language Construct Frequency 31 5.1 The Developed Analysis Tool . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 Limitations and Validity of Generalization . . . . . . . . . . . . . . . 33 6 Choosing a Target General-Purpose Language 35 6.1 The Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.1 Static or Dynamic Typing . . . . . . . . . . . . . . . . . . . . 35 6.1.2 Memory Management and Environment . . . . . . . . . . . . 35 6.1.3 Primitive Types . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1.4 Classes and Objects . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.1.6 Basic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.1.7 Pre-Processing and Meta-Programming . . . . . . . . . . . . . 37 6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7 Emulating COBOL Data Types in C++ 39 7.1 Enhanced Byte Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.2 The Default Decmial Type . . . . . . . . . . . . . . . . . . . . . . . . 40 7.3 Packed Decimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 8 Creating a Model-Driven Source-to-Source Compiler 47 8.1 Transformations between Code and Model . . . . . . . . . . . . . . . 48 8.2 Information in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8.2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 8.3 The Intermediate Concept Model . . . . . . . . . . . . . . . . . . . . 52 8.3.1 Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8.3.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8.3.3 Arithmetic Expression-Assignments . . . . . . . . . . . . . . . 55 8.3.4 Conditional Branching . . . . . . . . . . . . . . . . . . . . . . 58 8.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 8.3.6 Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.4 Transformation between Concept Model and Language Models . . . . 64 8.4.1 Basic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.4.2 Delegated Relations . . . . . . . . . . . . . . . . . . . . . . . . 66 8.4.3 Enforcing Order . . . . . . . . . . . . . . . . . . . . . . . . . . 69 8.4.4 Dealing With Strings in Medini QVT . . . . . . . . . . . . . . 71 8.4.5 Implementing Alternate Relations . . . . . . . . . . . . . . . . 72 8.4.6 Conditional Relations . . . . . . . . . . . . . . . . . . . . . . . 74 8.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.5.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.5.2 Intent Preservation . . . . . . . . . . . . . . . . . . . . . . . . 79 8.5.3 Construct Preservation . . . . . . . . . . . . . . . . . . . . . . 81 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 x Contents 9 Conclusion 89 9.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 9.2.1 On the Language Construct Analysis . . . . . . . . . . . . . . 90 9.2.2 On Emulating COBOL Data Types . . . . . . . . . . . . . . . 91 9.2.3 On the Source-to-Source Compiler . . . . . . . . . . . . . . . . 91 9.2.3.1 Choosing a Good Solution . . . . . . . . . . . . . . . 91 9.2.3.2 Eliminating the Intermediate Model . . . . . . . . . 91 9.2.3.3 Translation within a Language . . . . . . . . . . . . 92 9.2.3.4 Usages beyond Translation . . . . . . . . . . . . . . 92 Bibliography 93 xi Contents xii 1 Introduction A natural part of the technological evolution is the eventual decrease in popularity, and usage, of old technologies. In the field of Software Engineering, one such tech- nology is the general-purpose programming language COBOL (common business- oriented language). Its replacements include more popular languages, such as Java, C, C++, C#, Ruby, Python, recently even JavaScript. Our work aims to ease the transition between COBOL and a replacement language, by transforming old COBOL code to a new language, and then automatically propagating changes made in the new code back to the old COBOL code base. This advancement allows devel- opers with knowledge of only one of the languages to work together, as development in both languages can co-exist. Hopefully, important system and domain knowledge can be transferred between programmers of different generations in the process. 1.1 Motivation in Relation to Industry Conducting research about transforming COBOL code into a more modern language may be motivated by current circumstances in the industry, including: • there has been a vast decrease in education of new COBOL programmers [1], • existing COBOL programmers are nearing retirement age [2], • a majority of IT managers raise concerns about a current or future shortage of COBOL-knowledgeable developers [2], • COBOL is not a popular language anymore [3, 4], • COBOL is still widely used, especially in the financial sector [5] 1. There are examples of products which transform COBOL into C [6] and COBOL into Java [7]. However, these tools only support migration of COBOL code, meaning that the transformation is unidirectional, and after the transformation is done code is only written in the new language. As a consequence, the underlying system (often a mainframe) may need to be replaced. This presents a challenge for several reasons. First, it may be costly [2]. Second, it may present uncertainty, as there sometimes seems to be a feeling that the old systems, as quoted by a COBOL application 1Datamonitor [5] estimates that there were 200 Billion lines of COBOL code in usage 2009, and 5 Billion new lines being added each year. Additionally, they estimated that 90% of the world’s financial transactions were processed in COBOL. 1 1. Introduction manager, “work pretty well” [2]. Third, when existing COBOL developers retire, knowledge is lost. Finally, there might be serious performance issues when migrating to a new system [8]. By utilizing bidirectional translation of code, a COBOL system can remain intact and continue to be developed in C++, thus mitigating some of the factors described above. Replacing the COBOL language might, however, not be a solution to all prob- lems associated with legacy COBOL systems. While COBOL presents a language barrier, the systems themselves might have a complex design, and the COBOL- knowledgeable workforce not only knows COBOL, but also the domains of these systems. 1.2 Research Questions The following research questions will guide the thesis work: RQ1 Which are the most common COBOL constructs used in practice? RQ2 Which well-known language would be appropriate to use as target, in a trans- formation from COBOL? RQ3 How can a bidirectional transformation be applied in order to transform be- tween COBOL (source) and the well-known language (target), such that: (a) the resulting code is correct according to the target syntax, (b) at run-time, both programs behave the same in terms of output, given the same input, (c) changes in the target code base propagate back to the source code base, such that as much of the COBOL code as possible is left intact, with focus on locality of change, (d) and such that language constructs in the source language are translated in a manner such that intent, to the greatest extent, is preserved in the corresponding target language constructs? where we define intent in Definition 1.2.1. RQ2 and RQ1 are questions which need to be answered in order to 1) determine the focus and limitations of our study, and 2) to design the actual transformations in the main research question RQ3. Definition 1.2.1. Intent the implicit encoded reason why a piece of code is expressed in one way, when there are multiple ways to achieve the same goal. 1.3 Purpose The purpose of our work is to provide an approach for implementing a bidirectional source-to-source compiler using modern transformation techniques, including (but not limited to) techniques from the field of Model Driven Engineering (MDE). Fur- thermore, a prototype implementation will be presented as a proof of concept for the provided approach. 2 1. Introduction 1.4 Delimitations and Limitations One part of the outcome of the work we report on is a research prototype. As such, it is neither complete nor fully tested. However, it serves as a proof-of-concept that can be extended and generalized as the field of bidirectional language translation evolves. The prototype does not take code style into account. Instead, focus is put on the actual content of the code and approaches to translate it. Furthermore, although the approaches presented in this thesis are quite general, we make no claim of generality for programming languages dissimilar to the ones supported in the prototype. A comment on the usefulness of our work should be made. It could be that learning COBOL is a relatively small task, compared to understanding the complex systems written in COBOL. It could be theorized that most of the COBOL systems still in existence today are the most complex ones, since the more simpler ones have already been replaced by modern technologies. Our work will not solve the issue with complex system knowledge, but it could still be useful for a human wanting to adopt such knowledge. 1.5 Disposition The remainder of this thesis follows the structure of first presenting background theory, and then related work of direct relevancy to our contribution. After that, our methods of conducting work are presented. Next, our contribution is reported on and discussed in chapters 5 to 8. Finally, a conclusion and some ideas for future research are presented. In the part concerning our contribution, we iteratively report on how we solved several problems in several chapters (this constitutes our results), enabling us to reach our goal of answering all research questions. 3 1. Introduction 4 2 Theory This chapter introduces the foundations on which the related work (next chapter) of direct relation to our contribution, and our contribution, are built upon. Since our work covers multiple sub-fields of software engineering, a short summary with motivations follows to orient the reader. First, formal grammars are described, which can be used to parse code in text form and generate a model from it. As a specification, formal grammars also allow text to be generated from the grammar rules it encodes, possibly from an existing model. Second, the notion of a model is formalized, with the introduction of metamodels. The mappings between text and model are essential to our contribution and are further elaborated on in the Related Work chapter. Finally, transformations, especially those moving information be- tween different models (as opposed to text), are described. The important property of a transformation being bidirectional (similar to reversible) is explained in depth. 2.1 Formal Grammars A formal grammar (henceforth referred to simply as a grammar) contains a set of production rules, linking a list of nonterminal and terminal symbols, to another list of such symbols [9]. The terminal symbols represent the elementary symbols of a language, while the nonterminals contain a list of terminals and other nonterminals. Also, the grammar contains a start rule, imposing some ordering on the application of the production rules. Given a specification of a grammar, a certain language can be parsed by a machine. Certain constraints can be applied to the grammar, restricting which types of languages it accepts. Chomsky [9] proposed classifying grammars using a hierarchy of constraints, limiting the languages it accepts. The hierarchy orders four different classes of grammars, presented from general at the top, to less general at the bottom, in Figure 2.1. A certain grammar class is able to accept more than all languages all less general grammar classes are able to accept. At the bottom of the hierarchy, regular languages are found, e.g. those parseable by a regular expression. Regular languages allow linking one nonterminal to exactly one termianal, optionally either preceded or followed by a nonterminal. One level up, the context-free languages are found. These allow linking one nonterminal to a list of terminals and nonterminals. At the next level, context-sensitive grammars reside, allowing a context of terminals and nonterminals around both sides of a rule (the context has to be the same on both sides of the rule). Finally, the most general class contains the recursive enumerable grammars, allowing any terminals and nonterminals on both sides of a rule, thus 5 2. Theory removing the restriction that the context must be the same, as in context-sensitive grammars. Backus [10] proposed to standardize specification of production rules linking nonterminals to a list of terminals and nonterminals, allowing context-free grammars to be specified by a textual notation now known as the Backus-Naur Form (BNF). Grammars for general-purpose languages are typically expressed by the de-facto industry standard Extended BNF (EBNF) syntax, now standardized as ISO/IEC 14977 [11]. EBNF is an extension to BNF that adds syntactical sugar which allows easier specification of repetition (e.g. the Kleene star *, as also found in regular expressions). Figure 2.1: Chomsky’s hierarchy of languages acceptable by their respective grammar class (replace languages with grammars). Anything is a terminal or non-terminal. LeftContext and RightContext are terminals or nonterminals, but need to be exactly the same on both sides of the arrow (to define a context). In order to facilitate parsing of context-free languages by algorithms run on to- day’s computers, Knuth [12] proposed that a suitable subset would be LR grammars, denoting grammars that can parse an input stream by reading it left to right once (without backing up to reconsider previous choices). The algorithms would then ex- hibit linear complexity with respect to the input length. More specifically, a LR(k) grammar is allowed to have k tokens of lookahead, meaning that the parser can peek at no more than k upcoming lexical tokens in the input stream. LR-parsers have later become known for their other properties: they produce a rightmost derivation, 6 2. Theory resolving production rules from right-to-left, and thus produce a bottom-up parse (where the smallest elements of a language are recognized first) [13]. Some additional constraints imposed on the special case LR(1), also known as Canonical LR (CLR), have allowed the less resource-intensive Look-Ahead LR (LALR) and Simple LR (SLR) algorithms to be developed [13]. In fact, these the- oretically cover all countable k, as it can be proven that for all LR(k) grammars, there exists an equivalent LR(1) grammar. Additionally, the Generalized LR (GLR) parser builds upon LR parsing, making it less restrictive (but have the same com- plexity) [14]. These specializations of the LR parser have become popular choices for implementation in compilers, and consequently even language specifications are written with respect to them. For example, the popular parser generator yacc, and its successor Bison, are based on these algorithms [15]. In contrast to LR parsing, a different family of algorithms that have become known as LL parsers has been discovered [13]. These still read the input stream from left to right, but perform a leftmost derivation, resolving rules from left to right, thus producing a top-down parse (where the largest elements of a language are recognized first). For example, the popular parser generator ANTLR uses LL(*)-parsing (from version 4 ALL(*)-parsing), where the star denotes that k may be infinite [16]. For completeness, we mention that parsers are often table-based, meaning that they are built using pre-computed decision tables for which symbol is read, and what is in the look-ahead [13]. There also exist recursive-descent LL-parsers, and recursive-ascent LR-parsers, where the decisions are taken by actual program code, written natively in the programming language together with the parser. Lexers and parsers are typically generated by lexer and parser generators, i.e. tools which, based on a grammar, are able to automatically generate actual parser programs in native source code (C, Java etc.). Code of a certain programming language is typically read by a compiler using a two-step process of first applying lexical analysis to the input text stream using a regular language, then parsing them with a restricted context-free grammar (e.g. with a LALR parser) [17]. The lexer tokens provide a symbol representation of the underlying text stream, consisting of a name and an optional attribute (for example (INT, 42), when lexing the integer 42). The token names are then interpreted as nonterminals for the next stage, where the parser uses the context-free grammar to produce an abstract syntax tree (AST), for further use by the compiler. 2.2 Metamodeling The code-centric development approach, i.e. defining programs by writing code, is the prevalent and traditional way of developing software. One explanation for why this is the case might be that the majority of all general-purpose programming languages are defined by grammars, meaning that somewhere along the way code must be written to produce software. Another approach to describe and develop software is using models. In the field of software engineering, Seidewitz [18] defines a model as “a set of statements about some system under study (SUS)”, where a statement is “some expression about the SUS that can be considered true or false”. This definition is rather general and 7 2. Theory for a model to be meaningful there needs to exist an interpretation. Seidewitz [18] defines an interpretation as a “mapping of the model’s elements to elements of the SUS such that we can determine the truth value of statements in the model from the SUS”. For example, suppose we have an Entity-Relationship (ER) model. The model contains expressions of which data to be stored and relationships between collections of these data expressions (entities). If we say that our SUS is a relational database, the interpretation of the model is that an entity is mapped to a table, an attribute to a column, a relationship to a foreign key constraint, and so on. In this case, the statements of the model are true if their mapped counterparts in the SUS exist and conform to the model. If all statements in a model are true, Seidewitz [18] defines the model as being correct. A model can be expressed through the use of a modeling language. For example, in order to express an ER model, there needs to be a definition of how and what it can contain (e.g. entities, attributes, etc.), as well as how the different components relate; this is defined in a modeling language. The Unified Modeling Language (UML), containing language definitions for a wide range of applications, such Class diagrams (models) and State machines, is the predominate modeling language for software. A modeling language is itself defined by another model, referred to as a meta- model. A metamodel expresses statements which any model created in a modeling language needs to adhere to. As such, the interpretation of a metamodel is map- pings between its elements to the elements of the modeling language, and therefore a metamodel determines the validity of a model created in a modeling language [18]. An analogy can be made between a metamodel and a grammar. A grammar is a textual specification (a set of rules) of how a piece of code (text) may be written to be valid. Similarly, a metamodel is a specification of how a model, created in a certain modeling language (a set of rules), may be created to be valid. Since a metamodel is actually a model, it can be expressed in a modeling lan- guage, implying that a metamodel can be defined by a meta-metamodel. In turns, that meta-metamodel can be defined by a meta-meta-metamodel. Theoretically, there is no limit of how many metamodels there can exist in a chain of metamodels. However, at some point they are bound to be superfluous, specifying trivialities. In such a case, a metamodel can be used to describe itself. Within model-driven software engineering, there is a four-layer architecture traditionally used when referring to layers of metamodels, as depicted in Figure 2.2 [19]. At the top layer M3, there is a general (meta) model specified in the Meta- Object Facility (MOF) standard [20]. This M3 model is self-contained, meaning that it can be used to describe itself. An M3 model can also be used to describe an M2 model. The most commonly used M2 model is the UML metamodel, although the MOF standard does not limit the choice to UML. Layer M1 classifies user-defined models, conforming to the metamodel in layer M2. For example, a layer M1 user- defined UML model could be a model for modeling users in a system. Continuing the example, the corresponding layer M0 model would be the instance of the M1 (meta) model, i.e. the data which describes the actual user. See the right-hand side in Figure 2.2 for an illustrative example. 8 2. Theory Figure 2.2: An illustration of the four-layer architecture. The left-hand side of the figure illustrates the relations between the models of the four layers, whereas the right-hand provides an example of usage. 2.3 Transformations Transformations is a mechanism for establishing consistency between, or create new, artifacts through the means of a relation, where an artifact might be for example a text, a model, or any other kind of formally defined information. Transformations are used in a variety of areas including Software Engineering, Programming Lan- guages, and Databases. A simplistic definition of a unidirectional transformation is as follows r : S → T , where r is the function which specifies the relation between S, the set of source artifacts, and T , the set of target artifacts [21]. By applying the transformation, a target t that complies with the source s, according to the relation r, can be produced. Consistency between a source and a target is considered true if and only if t = r(s). Note that in grammar-related literature, such as [22], a transformation may be referred to as a translation when it is performed on text. In general, it is intuitive to think of a translation as a transformation, since a trans- lation transforms text, although context-dependent meanings might apply in some cases. We will use translation and transformation interchangeably when referring to the end-to-end process of translating code; note that a translation may consist of several transformations that are not text-based, as long as the outcome is text. Apart from unidirectional transformations, there are also bidirectional transfor- mations where a relation between two (or more) artifacts applies in both directions. Borrowing from Stevens [21], a bidirectional transformation can be defined by a statement of consistency R(s, t) which is deemed true if the source s and the target 9 2. Theory t is consistent in regards to relations R. For each relation in R, there are two di- rectional transformations, one in each direction, such that −→R : S × T → S (forward direction) and ←−R : S × T → T (backward direction); that is, given a pair of models s and t, the transformation should be able to restore consistency in any direction according to the relations R. It should be noted that in a truly bidirectional transfor- mation, there is no distinction between which artifact is the source and which is the target. However, in literature, the terms are still used when describing bidirectional transformations. Bidirectional transformations are inherently more complex than unidirectional transformations. One of the main issues faced with bidirectional transformations is dealing with non-trivial types of relations. A relation between two artifacts in S and T , respectively, is bijective, if for every artifact s ∈ S there exists exactly one unique artifact t ∈ T , i.e. there is a one-to-one mapping between artifacts (see Figure 2.3a). Bijective relations are easy to handle, in the sense that there is always an obvious way to calculate the inverse of a bijective transformation, i.e. the transformation in the other direction. However, a bijective constraint may be too restrictive in many cases [21]. Surjective relations, on the other hand, allow for more flexibility. A relation is surjective if for every s ∈ S there exist an artifact t ∈ T (not necessarily unique), i.e. there exist many-to-one mappings (see Figure 2.3b). However, there is no definite way of calculating the inverse of a surjective transformation, since the inverse of a many-to-one transformation is a one-to-many, where any of the options are valid. We refer to this as the general case, see Figure 2.3c. Henceforth, we refer to relations of a surjective and general case nature as non-bijective relations. Figure 2.3: An illustration of the types of relation in transformations. (a) shows a bijective relation, which is invertible. (b) shows a surjective relation S → T , which turns in to the general case (c) in the opposite direction T → S. Although applicable in more places, non-bijective relations present problems when restoring consistency. For example, which of the alternatives in a one-to- many mapping to use; this is a matter of solving ambiguities. In some cases, it might not matter since all are valid solutions, but in other cases it does. Consider, for example, if s2 has been transformed into t2 in the situation in Figure 2.3b. There are then 2 possible ways to transform t2 into an artifact part of S, namely either s2 or s3, as illustrated in Figure 2.3c. In some cases, e.g. if no changes were made to t2, it might be important that s2 is chosen, since s2 was the original artifact before transformation. 10 2. Theory 2.4 Bidirectional Transformation Languages There are several approaches towards implementing a bidirectional transformation. Perhaps the most natural method that comes to mind is to use a pair of unidirec- tional transformations such that r1 : S → T and r2 : T → S, forming a relation r. The problem with this approach is the absence of consistency validity. For example, consider r1 to be a valid transformation. There is no guarantee that r2 is compli- ant with r1 and produces the intended result according to the relation r, since it is specified separately from r1. Poskitt et al. [23] propose a formal way of address- ing this issue by automatically translating a pair of unidirectional translations to graph transformations, i.e. states of computations presented as graphs, and nested conditions, in order to apply graph transformation validation techniques to ensure consistency. However, their work is not yet available as a tool for public use. Another approach for implementing bidirectional transformations is using a transformation language which supports bidirectional transformations. Such lan- guages are able to express bidirectional transformations as one unit, i.e. they are able to express the forward and the backward transformation simultaneously. The benefit of this is that transformation validity is ensured by construct [24]. There is a variety of different bidirectional transformation languages and tools operating on different artifacts (e.g. models, text). Their methods of transformation also vary, which may give substantially different results between implementations. The re- mainder of this section is used to describe the most common of these transformation methods, as well as a few bidirectional languages which use them. In order to compare bidirectional transformation languages, we differentiate between two types: those which are able to support model synchronization and those which are only able to generate new models. Supporting model synchronization entails that information from an old version of a model (artifact) can be used to recreate, i.e. synchronize, a new version, meaning that potentially less information is lost. 2.4.1 Triple Graph Grammars One of the common methods for implementing bidirectional languages is Triple Graph Grammars (TGGs), as introduced by Schürr in 1994 [25]. TGGs use graph grammars, whereby the artifacts to be transformed (source and target) are inter- preted as graphs, along with an intermediate correspondence graph which links the source and the target (i.e. defines the relations). Through the use of these graphs, a language implementing TGGs can then generate the forward and backward trans- formations, as well as check consistency. Three bidirectional languages, all based on TGGs and operating on EMF-based models (as artifacts), are eMoflon [26], MoTE [27], and TGG Interpreter [28]. Their specification methods are similar and follow a general TGG rule format. Further- more, they all support model synchronization. However, their implementations dif- fer. MoTE requires all TGG specifications to be conflict-free [29], meaning that all relations are required to be bijective. TGG Interpreter, on the other hand, does not require specifications to be conflict-free, nor does it guarantee that a valid (or 11 2. Theory complete) model will be produced, as it does not implement any lookahead or back- tracking; as such, if one wrong decision is taken during the transformation, the trans- formation will continue down the erroneous path. eMoflon, like TGG Interpreter, does not either require specifications to be conflict-free. Unlike TGG Interpreter, however, eMoflon implements local look-ahead, guaranteeing that a valid model, but not which valid model, is produced. In a comparison of TGG tools, Greenyer [29] concludes that MoTE should be used in situations where conflict-free specifications suffice, due to its efficiency. eMoflon should be used in situations which require look-ahead/backtracking, but then comes with an overhead that affects performance. The benefit of TGG Inter- preter is preservation of information during transformations. 2.4.2 QVT Relational Unlike Triple Graph Grammars, QVT Relational (QVT-r) is not a method for im- plementing bidirectional transformation, but rather a language specification dictat- ing how transformations should be specified and the intended outcome of transfor- mations; how the actual transformations are implemented is not exactly specified. QVT-r is part of the MOF-based QVT (Query View Transformation) specifica- tion [30], published by the Object Management Group. QVT-r takes a similar high-level view of defining transformations as TGGs and Stevens [31] notes that the QVT core language definition is clearly influenced by TGGs. According to Stevens [31], the QVT-r standard is somewhat ambivalent to- wards allowing non-bijective transformations. She argues that the standard can be misinterpreted and does not make explicit that non-bijective transformations are allowed. However, she does show that non-bijective transformations are possible to implement, according to the standard, although the expressiveness of those has its limitations. There are several tools, with different underlying implementations, that imple- ment the QVT-r specification. Medini QVT [32] is perhaps the most mature one. Operating on EMF-based models, Medini QVT can be run as standalone Java pro- gram or as an Eclipse plugin (with debug support). Medini QVT does not support model synchronization, meaning that a new target is generated from scratch during every transformation. As such, transformations in Medini QVT can be denoted as−→ R : S → T and ←−R : T → S, rather than the more general definition, presented in Section 2.3, which has both domains S and T as input parameters Another tool, based on EMF-models, which implements the QVT-r specifica- tion, is Echo [33]. Echo implements transformations by translating models and relations into Alloy [34], which contains a model finder and satisfiability (SAT) problem solver. A SAT problem is set up for Alloy to solve with parameters to ensure the principle of least change, meaning that a transformed model will be as close to its original as possible. Therefore, Echo supports model synchronization through its least-change approach. Notable is also that Echo is able to produce sev- eral solutions, each further away from the least-possible change. Furthermore, Echo also takes OCL (Object Constraint Language) constraints into account when finding appropriate models, meaning that the outcome will always be valid with regards to 12 2. Theory its metamodel. In comparison, Medini QVT does not take OCL constraints into account, meaning that it may produce invalid models. However, using a SAT solver naturally results in a larger performance overhead. The Eclipse Foundation is working on its own implementation of QVT-r, in- cluded in Eclipse QVTd[35]. While a rudimentary implementation is already re- leased, the full version is planned to be released in July, 2017. Another tool which does not implement the QVT-r standard, but has a most similar syntax to that of QVT-R and works in a similar manner, is Janus Trans- formation Language (JTL) [36]. JTL works on EMF-based models and addresses change propagation through means of Answer Set Programming; when a change is detected in the source, a constraint (SAT) solver is used to find the ideal solution for a consistent target. If there is more than one solution, JTL is able to return all. Principally, JTL is comparable to Echo, although their underlying implementations differ. 2.4.3 Text-based approaches The approaches presented so far have all been based on models, in the sense that they operate on models defined by a metamodel. There are also a set of bidirectional transformation tools which operate on text-based artifacts, e.g. XML (Extensible Markup Language) documents. As noted by Stevens [21], it might seem that model- based transformations should be capable of handling XML documents, since models are defined in XMI (XML Metadata Interchange) files and both formats share many properties. However, there are differences (e.g. handling of identifiers), which make these approaches incompatible. biXid [37] is an XML-to-XML bidirectional transformation language, based on a programming-by-relation approach of specifying transformations (similar to that of QVT-R). biXid supports non-bijective relations but will choose a solution non- deterministically in cases where ambiguity exists. XSugar [38] is another text-based bidirectional transformation language, able to transform between XML and a self-defined text format. XSugar requires a context- free grammar to be specified for the text format to transform to. Given such a grammar, XSugar is able to map elements of an XML specification to those of the grammar, thereby bidirectionally transforming text between the two formats. The limitation of this approach is a lack of support for non-bijective relations. Another text-based bidirectional transformation language is Boomerang [39]. In comparison to XSugar, Boomerang is generally applicable to self-defined text for- mats and includes a rich set of features for writing transformations. Transformations in Boomerang are written as lenses. A lens is a bidirectional program operating be- tween two domains S and T . A lens requires that T is an abstraction of S, meaning that all information in T must have corresponding representations in S (but not the other way around). A lens mapping between the two domains consist of the following operations [39]: • Get : S → T • Put : T → S → S 13 2. Theory • Create : T → S which have to obey the following laws: 1. Put (Get s) s = s, meaning that any information lost when transforming an artifact s to an abstract artifact t, must be restored when transforming back, 2. Get (Put t s) = t and 3. Get (Create t) = t, both meaning that there should be no loss of information when transforming from the T domain to the S domain and back again. As such, all transformations obeying the laws of lenses are well-behaved, although this comes at the cost that the T domain must be an abstraction of S. In Boomerang, a lens can consist of a composition of other lenses. Boomerang provides a basic set of lenses for transforming strings. Based on those, user-defined lenses can be written in an approach that can be described as a mix of specifying grammar and relations. Furthermore, Boomerang supports non-bijective relations in the sense that when there is ambiguity (which only happens when transforming from T → S), the programmer can control how consistency should be restored. 14 3 Related Work This chapter introduces some research of direct relevancy to our contribution, build- ing upon the foundations laid out in the previous Theory chapter. Two impor- tant research areas are summarized: how the technological spaces of grammars and metamodels can be bridged, and how bidirectional transformations between general- purpose languages can be performed. 3.1 Bridging Grammars and Metamodels Grammars and Metamodels are two different methods of expressing structure for in- formation (code and models, respectively), and they belong to and are treated in two different technological spaces [40]: grammarware and modelware. Grammars tend to be used for most general-purpose programming languages. In particular, context- free grammars (e.g. BNF), along with plain English for inexpressable semantics, are often used to specify programming language definitions. Equivalently, metamod- els are used to describe the semantics of models. While a programming language may theoretically be specified in models, as is the case with some Domain-Specific Languages (DSL), it is not commonly the case for general purpose programming languages. To fully utilize strengths of both grammarware and modelware, it would be beneficial for information specified in one technological space to be available in the other. Recalling Seidewitz’s [18] definition of a model, it may seem trivial that a model can represent code, i.e. the SUS is actual programming code and the interpretation is that if there exist a language construct in the model, it will also exist correctly (in various different aspects) in the code. In a similar manner, a metamodel can be thought of to represent a grammar. While this might seem plausible in theory, the realization of such a bridge between the two technological spaces presents a handful of issues. Alanen and Porres [41] studied the relationship between context-free grammars and MOF metamodels. In their work, they defined relations between BNF and MOF-defined metamodels in both directions. The basic relationship is one where a rule in BNF is converted to a class in the metamodel. They found that an arbi- trary BNF grammar can be converted to a metamodel using an algorithm operating on the M2 layer; the resulting metamodel may not be very intelligible, depend- ing on the grammar. In the opposite direction, however, they found that the set of metamodels that can be converted to BNF grammars is restricted. Metamod- els intrinsically contain more information than BNF grammars, which presents the 15 3. Related Work problem of maintaining that information when transforming back to a metamodel from a BNF grammar. Inspired by the work of Alanen and Porres, Wimmer and Kramler [42] pro- pose a theoretical framework for bridging the two technological spaces. Their work extends that of Alanen and Porres by defining relations in the M1 layer, meaning that programs and models can be transformed. Furthermore, Wimmer and Kramler tries to enhance generated metamodels by a series of transformation rules which eliminates redundancy. They also introduce the concept of a change model which can contain annotations that dictate properties of a metamodel derived from the grammar. These annotations include specifying which attributes should be treated as references instead of compositions, what identifiers should be used for referencing, and data type specification for attributes (e.g. string, int, boolean). By applying the change model, the generated metamodel will be richer in details. Kunert [43] presents a similar framework to that of Wimmer and Kramler, although instead of having a change model, Kunert extends EBNF to support specification of annota- tions directly in the grammar. For example, terminals which should not be included in the abstract syntax tree, and therefore neither in a metamodel generated from the grammar, can be marked to be excluded from the generated metamodel. Al- though the problem of maintaining information when converting from a metamodel to a BNF-based grammar and back still remains using annotations, the set of sup- ported metamodels becomes less restrictive, since the grammar contains more shared information with MOF metamodels. 3.1.1 Xtext While the frameworks by Wimmer and Kramler [42] as well as Kunert [43] are largely experimental, there is a more mature framework included in the Eclipse project, called Xtext, which is able to bridge the grammarware and modelware technological spaces. Xtext is a framework intended for creating textual DSLs [44]. It shares similar properties to Kunert’s framework, namely the ability to annotate grammars in order to describe the generated metamodel. One difference to Kunert’s framework is that Xtext takes the approach of specifying what should be included, rather than what should be excluded. That is, Xtext’s grammar syntax, which has a similar style to that of EBNF, is quite rich in details about what should be included and how entities should be represented (e.g. reference/containment, collection or singular objects, data types, names, etc.) in the generated metamodel. The example in Figure 3.1 shows an example of Xtext’s grammar syntax. Xtext is also able to generate a grammar based on an already existing metamodel, essentially taking the metamodel and treating it as a metamodel for the abstract syntax tree. Furthermore, Xtext includes support for model-to-text transformations in both directions, i.e. serialization and deserialization of a model, and is able to generate a textual editor for a language described by a Xtext grammar. 16 3. Related Work Usage EBNF Xtext Definition = : Concatenation , Whitespace Termination ; ; Alternation | | Optional [...] (...)? Zero or more {...}* (...)* One or more {...}+ (...)+ Grouping (...) (...) Exception - ! Terminal String "..." ’...’ or "..." Enumeration name = "opt1" | "opt2" enum name: value1="opt1" | value2="opt2" Simple assignment A = B C A: bAttr=B cAttr=B Collection assignment A = B+ A: bs+=B+ Table 3.1: A summary of the relationships between notations in EBNF and Xtext grammars, as found by Yue[47] Figure 3.1: An illustration of basic Xtext syntax. The figure shows a rule of type AClassType, with one attribute key and a collection alternativeKeys. Although Xtext was initially intended for specifying DLSs [44], there are no theoretical limitations for using it to specify existing general-purpose programming languages. The limitations are rather those of the underlying parser generator. Xtext uses ANTLR 3 [45] in order to generate a parser which parses textual input (i.e. code). ANTLR 3 generates an LL(*) parser [46], as explained in Section 2.1. One drawback of using LL(*) is the non-tolerance of left-recursive rules - a problem which needs to be addressed as quite often existing general-purpose programming languages are defined in BNF-based grammars, which allow left recursion. Another problem is addressing the lack of meta information available in the existing BNF- related grammars. Xtext, for example, requires names for all attributes, as well as specification of cardinality (i.e. singular or list), as examplified by Figure 3.1. Yue outlines the relationship between notations in EBNF and Xtext in [47]. Table 3.1 summarizes some of these relationships. Additionally, Yue [47] presents a set of strategies for dealing with patterns allowed in EBNF, such as left recursion and common factors (e.g. A = B | B C, where B is the common factor), but not in Xtext due to the underlying parser generator. Yue [47] does not suggest an automatic process for converting a EBNF gram- 17 3. Related Work mar to Xtext format; nor does she deal with the lack of information (e.g. references, data types) which is required to generate a fully fledged Xtext grammar from the Xtext grammar. While there does not exist a fully automated solution to convert a complete Xtext grammar today, there exists semi-automated ones. Bergamyr and Wimmel [48] present such an approach, where they define an Xtext grammar for EBNF itself, allowing an arbitrary EBNF grammar to be converted into a model. They then apply model-based transformation techniques in order to translate an EBNF model into an Xtext model (grammar), based on similar relationships as pre- sented in Table 3.1. It is unclear if they automatically refactor left recursive rules, but their article mentions dealing with difficulties related to LL(*)-based parsing, of which existence of left-recursive rules is a primary problem. Although they are able to transform many types of entities automatically, cross references still have to be handled manually. They deal with this in a similar manner as Kunert [43], by extending their EBNF grammar to allow annotating attributes as references. How- ever, they deal with refinement of types for cross references in the Xtext grammar automatically by example, meaning that if given sample code containing cross ref- erences, a mechanism is able to figure more specific types for the cross references in the grammar. All methods of bridging grammars and metamodels mentioned above will most likely require additional refinement to produce a useful metamodel. The reason for this is that most grammars are not intended to be used as metamodels and therefore they contain less information than required to produce a useful metamodel. There- fore, going from the grammarware technological space to the modelware technolog- ical space may entail a certain overhead in terms of refining a generated metamodel 3.2 Translation of General-Purpose Languages The idea of translating between general-purpose computer languages arose during the 1950s-1960s, when code was highly dependent on the computer type it operated on, and therefore not very portable [49]. Early comprehensive attempts to solve this problem involved the construction of an intermediate universal language, UN- COL [50]; the idea being that every programming language had a translator which translated to UNCOL, and every computer had a translator which translated UN- COL into machine code. However, this idea was never fully realized [49]; although the later-created LLVM is built on a similar idea [51]. In the remainder of this section, subsequent approaches to primarily bidirectional language transformation are presented. 3.2.1 Early Attempts at Bidirectional Translation Early attempts at bidirectional source-to-source translation include the work of Al- brecht, Krieg-Brückner [52], et al. in which they translate between Ada and Pascal. Their work translation consists of a chain of sub-translation in the following manner (illustrated in Figure 3.2): Ada is translated to a subset of the Ada language, termed AdaP. AdaP is translated to an intermediate representation. The intermediate rep- resentation is translated to a subset of the Pascal language, termed PascalA (since 18 3. Related Work PascalA is a sublanguage of Pascal, no transformation from PascalA to Pascal is required). The opposite order applies for translation in the other direction. Figure 3.2: An illustration of Albrecht et al.’s [52] approach to bidirectionally translate between Ada and Pascal Code. The arrows each represent a transformation. Note that AdaP ⊂ Ada and PascalA ⊂ Pascal. The sub-languages AdaP and PascalP are defined by Albrecht et al. [52] to only include language constructs which have bijective relations to the corresponding language constructs of the other sub-language, i.e. such that there only exists one one-to-one mapping between a language construct and its respective counterpart. For example, assignment of data to a variable is possible in both languages and is done in a similar manner which, if translated, could have a bijective relation. Therefore, variable assignment is included in the sub-languages AdaP and PascalP. However, array assignment is only possible in Ada, and not in Pascal. Therefore, array assignment is not included in the sublanguages. An array assignment can, however, be expressed as multiple variable assignments each accessing one element of the array. As such, array assignments in Ada can be translated by first decompos- ing an array assignment in Ada into multiple variable assignments in AdaP. AdaP is then translated in the rest of the translation chain. The purpose of the transforma- tions from Ada and Pascal into their respective sublanguages is therefore to convert language constructs which have no direct one-to-one mapping to the other language into those which have. In order to implement the bidirectional transformation between the two sub- languages, Albrecht et al. [52] use an intermediate representation to simplify the transformation. This intermediate representation is the unifying abstract syntax tree (AST) of both sublanguages, and contain only the necessary information to express programs in both sublanguages, i.e. unnecessary syntactic and logical in- formation is removed. In total, three different transformations for each language (six in total) are used, as depicted in Figure 3.2. It is worth noting that there are no transformations from a sublanguage to its full version, meaning that once a pro- gram has been translated, advanced language constructs will have been replaced. It should also be noted that Ada and Pascal are not fully compatible and therefore Albrecht et al. [52] do not support the full language specification to be translated. 19 3. Related Work 3.2.2 The Idea of Concepts Common to Code between Lan- guages Krieg-Brückner, one of the authors of [52], later generalized the ideas of the Ada-to- Pascal translation method in [53]. Krieg-Brückner proposes that concepts that are more general than specific language constructs should be identified when translating code, first to translate to a subset of the source language (compatible with a greatest- common-divisor intermediary), and then into the target language. As an example, an advanced Fortran IF-statement is translated into a basic variant using IF and GOTO, still in Fortran. The simple IF-and-GOTO construction is then translated into equivalent code in Pascal. Thus, the advanced IF-statement, that was supported only by Fortran, could be emulated by simpler code that had a direct counterpart in Pascal. Krieg-Brückner also proposes that certain concepts might be emulated in the target language (for example, exceptions in Ada have no direct counterpart in Pascal). A discussion follows about which abstraction level to choose: should all higher-level constructs be mapped to IF-and-GOTO constructions, and all types be mapped to a chunk of raw memory (e.g. a byte array)? Or, should the higher level concepts be kept as far as possible in the translation architecture? We will treat this questions in our result section, with some support byWing. Wing [54] introduced the idea of computational thinking, describing how humans can be trained into thinking like a computer, abstracting problems with models and then algorithmically solving them. 3.2.3 Formalizing Bidirectional Translation Building on the work of Albrecht et al. [52], Yellin [55] developed a method for trans- lating general-purpose programming languages, based on invertible attribute gram- mars (AGs) and an intermediate language. AGs, a concept proposed by Knuth [56], is a formal way of expressing the semantics of a context-free language, by associat- ing attributes with production rules of a context-free grammar. The value of these attributes can be set by assignment of other attributes or semantic functions, and when evaluated, an attribute grammar can, for example, be used in the translation of code. Yellin and Mueckstein [57] proposed a method for automatically inverting at- tribute grammars that specify translations between two languages. Using this method, a translation (transformation) can be written from one language to an- other and the translation in the other direction will be automatically generated, with validity ensured by construct. As such, this method shares many similarities with simpler bidirectional transformation languages (and perhaps it could even be considered as one). For an attribute grammar to be invertible, it has to be specified on what Yellin and Mueckstein [57] refer to as restricted inverse form, which limits its expressiveness (see [57] for more details). However, this does not seem to have a high impact when translating most general-purpose programming constructs. With invertible attribute grammars as the basis for translation between lan- guages, Yellin [55] proposed a general architecture for source-to-source translation between two or more programming languages, as illustrated in Figure 3.3. Every 20 3. Related Work language construct in language A is translated to one language construct in the inter- mediate language I. Likewise, every language construct in language B is translated to one language construct in I. Every language construct in I may be translated into one of several language construct in A and B respectively. One benefit of having this intermediate language is that adding a new language to the translation process requires only one additional translation to be written, namely, one from the new language to the intermediate language (e.g. TC in Figure 3.3), since the inverse is automatically generated (e.g. T −1 C ). If translating between n languages, only n transformation would need to be written in total, compared to ( n 2 ) (one for every pair of languages) had an intermediate language not been used. Figure 3.3: An illustration of Yellin’s [55] architecture for bidirectionally translate between multiple languages. This figure is based on the work of Yellin [55]. The intermediate language I is defined by Yellin as the canonical form, mean- ing: “1. Every program (in each source language) must be expressible as a program in the canonical form, and 2. every canonical form program that is the image of one translation must be expressible as the image of every other translation.” [55]. Yellin then goes on to say that a preferable strategy in order to pick a canonical form is to select the greatest common divisor of all languages involved in the trans- lation, meaning that every language construct that has corresponding construct(s) in all other source languages should be included. This way, as much as possible of the program structure as possible is kept intact during a translation. Otherwise, since virtually all general-purpose languages are Turing complete, many language constructs could be written in the form of another (e.g. a loop could be written with GOTO statements and thus only GOTO would be needed in the intermediate language), but this could possibly result in unreadable code after translation. One benefit of using an intermediate language selected by greatest common divisor is that the transformation becomes more concept-centric, as described in Section 3.2.2. Compared to the approach taken by Albrecht et al. [52] consisting of two layers of translations, every language construct which does not exist in both languages can be mapped directly to a viable representation in the intermediate lan- guage. For example, in C, a numeric variable can be incremented in different ways, e.g. (i) i++, (ii) i += 1, (ii) i = i + 1. However, when translating to a language which only permits expression assignments (iii), the greatest common divisor is (iii). This scenario gives rise to a many-to-one mapping, and a one-to-one mapping, as illustrated in Figure 3.4. Albrecht et al. [52] would deal with this situation by first 21 3. Related Work transforming alternatives (i) and (ii) to (iii) in a sublanguage of C. Their sublan- guage representation is then translated to the other language. As such, Yellin’s [55] approach requires less transformations. It does, however, require that any many- to-one mapping is managed when translating in the other direction. For example, which alternative (i), (ii), or (iii) should be chosen when transforming from the other language back to C? Yellin solves this by statically selecting one alternative as the only alternative and marking the other production rules in the attribute grammar as non-invertible (illustrated by the direction of the transformation of alternative (i) and (ii) in Figure 3.4). Figure 3.4: An example illustrating a translation scenario with multiple ways to express an increment statement in one language (left), and only one in another language (right). The GCD of these languages is i = i + 1 (middle). Yellin [55] addresses an issue with defining the intermediate language based on greatest common divisor, namely that if another source language with a lower level of abstraction is added, the entire intermediate language might need to be re- defined. As such, he discusses another form, least common multiple, which entails that all high-level constructs found in any of all source languages are included in the intermediate language. The burden, instead, is then to write translations which can describe every high-level construct not found in a language, based on the low- level constructs that exist. Furthermore, the intermediate language may become incomprehensible and hard to grasp. 22 4 Method As a general discipline, design research was used as a research methodology due to the specific circumstances of our work; specifically solving the real-world problems identified in section 1.1 based on theory and real-world context, and refining existing design disciplines. In order to answer all research questions, results were reached by iteratively solving smaller problems and combining them. The result chapters outline the main problems explored: • how COBOL is used in practice, • which popular language is suitable for translation together with COBOL, • how such a translation can be performed bidirectionally, • and what strategies are needed for parts not directly translatable. Systematically, we followed the guidelines on design research defined by Hevner et. al. [58] (refer to Table 1 in their paper): Guideline 1: Design as an Artifact Our work produced a viable artifact in the form of a working source-to-source compiler prototype. Guideline 2: Problem Relevance Our work is of relevancy to the industry, as motivated in the Introduction chapter. Guideline 3: Design Evaluation The artifact was evaluated, as described later in this chapter, on its utility, quality, and efficacy. Guideline 4: Research Contributions We present a research contribution in the area of bidirectional transformations on general-purpose programming lan- guages, using higher-level concepts (as presented as results in Chapter 8). Guideline 5: Research Rigor The results were reached and evaluated by scien- tifically applying the methods described later in this chapter. Guideline 6: Design as a Search Process We extensively used and combined available artifacts, in the form of already developed software tools. Where no suitable software artifact was available in order to solve a problem, we developed one. Guideline 7: Communication of Research This thesis serves to introduce and develop the field. While is is mostly technically oriented, there are some parts of relevancy to management. Specifically, the importance and implications to the industry have already been described in the Introduction. 23 4. Method 4.1 Survey of Cobol Language Construct Frequency In order to survey how COBOL is used in practice, a tool was developed that can count occurrences of the different language constructs available to COBOL programmers in actual program code. The formal grammar that the open-source compiler GnuCOBOL [6] (formerly OpenCOBOL) uses when parsing programs for compilation, was used as the baseline for what language constructs the tool can recognize. For analysis, the full source code of the open-source program Applewood Computing Accounting System [59] was used. The analysis results ranked usage of language constructs, by occurrence, in the analyzed code base. The results are presented in chapter 5. 4.2 Choosing a Target General-Purpose Language The task of choosing a language to transform COBOL between is many-faceted. First, a requirement has been that the language is popular and general-purpose in nature. Next, different languages exhibit different properties, making them compat- ible to different degrees. This section describes how we chose the best candidate for our requirements. The results are presented in chapter 6. First, a list of popular general-purpose languages was compiled, after which a comparison of these languages and COBOL was performed. In the comparison, we chose different criteria to classify the languages. Languages were eliminated from the comparison early on when it was clear they were not suitable for use together with COBOL. These criteria were chosen, according to our best judgment of relevancy, to be the following: • Static or dynamic typing • Memory management (allocation on stack, heap, and garbage collection) • Primitive types • Classes and objects • Functions • Basic syntax (arithmetics, conditionals, branching, loops) • Pre-processing and meta-programming (macros, templates, overloads) 4.3 Emulating COBOL Data Types in C++ As is perhaps already clear from the heading, C++ was ultimately chosen as an optimal target language. It was also established (in the results, see chapter 7) that some data types cannot be translated directly from COBOL to C++, as there is no direct counterpart. These types can, however, be emulated in code by user-defined data types. A proof-of-concept standard library, providing emulation for all COBOL data types not already supported by C++, was developed with the following goals: 24 4. Method • Bit-perfect storage of data (mirroring the COBOL format). • Integer formats: – Plus operator overloading (showing arithmetic support). – Casting operator overloading between the emulated types and native types. – Arithmetic overflow callbacks. • Strings: – Proper formatting of string data types, as COBOL formats them (arbi- trary COBOL PICTUREs). • Compound data definition: – COBOL RECORD emulation. – COBOL REDEFINES RECORD emulation. Each method/overload on the developed types was tested when implemented, show- ing that the developed features work. The corresponding result chapter is chapter 7. 4.4 Creating a Model-Driven Source-to-Source Com- piler The general method used for translation in the developed source-to-source com- piler can be described in 4 steps: a text-to-model transformation from the source language, a model-to-model transformations to an intermediate model, a model-to- model transformation from the intermediate model, and a model-to-text transfor- mation to the target language. This process is illustrated as an activity diagram in Figure 4.1. Figure 4.1: An activity diagram showing the transformation flow, including each step involved in the translation process. 25 4. Method In order to implement the model-to-model transformations, the bidirectional transformation language QVT-R was used. The choice of using a bidirectional trans- formation language, compared to specifying pairs of unidirectional transformations, is motivated by gaining transformation validity by construct. This is a means to partly answer RQ3a and RQ3b. The motivation for choosing QVT-R over the other options (presented in Section 2.4) is threefold: 1. QVT-R is an established language standard, meaning that code written could be run in several tools (interpreters), even eventual future tools which may become more mature over time. In contrast, tools implementing TGGs might have slightly different syntax and setup, and all text-based transformation languages have different syntax. 2. There are tools available that implement model synchronization and handle non-bijective relations. These tools are almost exclusively built on top of Ecore, meaning that any models used will be compatible with any other Ecore- based tools (for example, analysis and diagram tools to improve quality and understating). In short, many opportunities in the MDE field will become available. 3. QVT-R provides the option to specify OCL expressions for handling complex relations and conditions. To our understanding, text-based languages (such as Boomerang) and TGG-based tools are not as flexible in this aspect. Furthermore, a reason for not selecting a text-based transformation language is uncertainties about whether a C++ and COBOL grammar can be fully specified in such a language. The remainder of this section presents the steps taken when implementing the prototype source-to-source compiler. Work was guided by small iterations where each of these steps was to some extent included. The outcome of each iteration determined decisions and scope of the next. 4.4.1 Creation of an Xtext COBOL Grammar Since QVT-R is a model-based transformation language, COBOL and C++ code must be transformed to models in order to apply transformations. To bridge the grammarware and modelware technological spaces, Xtext was used. That is, an Xtext grammar for COBOL was defined and Xtext was used to generate the corre- sponding metamodel, enabling text-to-model transformations for COBOL code. The focus of our work is primarily on translation. As such, we chose to reuse and adopt an already existing grammar for COBOL. The grammar chosen [60] was created by Lämmel and Verhoef in [61] and reflects the IBM’s VS COBOL II Reference Summary. The grammar is specified on EBNF format and has been tested on over 2 million lines of COBOL code [61]. Lämmel and Verhoef’s [61] COBOL grammar was converted to an Xtext gram- mar using Yue’s [47] approaches and rules for conversion between EBNF and Xtext (See Table 3.1). In addition to these rules, another rule one was required: the symbol ‘||’ in Lämmel and Verhoef’s grammar denotes any permutation of two rules, which is represented as an unordered group, i.e. the ‘&’ symbol, in an Xtext grammar. 26 4. Method 4.4.2 Creation of an Intermediate Model Like Yellin [55], we used an intermediate representation (model) to implement trans- formations between the two source languages. This intermediate model was defined to be the canonical form of the two languages. Unlike Yellin, however, we present our own approach, based on common concepts between programming languages, for defining the canonical form, in order to answer RQ3d. This approach, together with the resulting model, is presented as a result in Section 8.3. 4.4.3 Specifying Transformations in QVT-R using Echo The QVT-R transformations between each of the source models were originally intended to be run by Echo. The motivation for this was threefold: 1. Echo provides support for model synchronization through its least-change prin- ciple. In practice, this would mean that any changes made to the target model would propagate back and result in a minimally changed source model. As such, RQ3c would, in theory, be answered, since the least-change principle would guarantee only local changes. 2. Echo’s least-change principle would guarantee that that any non-transformable information excluded in the transformation would remain intact after a back- and-forth transformation. 3. Echo provides support for generating all possible solutions when synchronizing models, meaning that a user or an automated mechanism could pick the most appropriate solution. Using Echo v0.3.1, Eclipse Modeling Tools v4.6.2, Eclipse QVTd v0.11, and Eclipse OCL v4.2, we wrote the initial transformations in QVT-R. However, we encountered numerous problems with Echo. First, we discovered that Echo does not support certain Ecore data types present in our metamodels generated by Xtext. Second, when running transformations, several scenarios produced Java exceptions, which required us to debug Echo’s source code, and in some cases modify it. Finally, we could not make Echo obey containment rules in metamodels, i.e. Echo treated a containment (composition) as a regular reference. As a consequence, the solutions produced by Echo were invalid. The solutions produced could become massive; that is, Echo could produce solutions which contained classes that were not connected to anything else in the model. As such, when the model grew in size, the SAT problems solved by Echo grew even more in size (since they included all invalid solutions) and took too long to solve. As an example, synchronizing two simple models with 5 classes each, where each class had a small set of attributes and relations to other classes in the same model, could produce a SAT problem with over 2 million variables, taking over 30 minutes to solve. Therefore, we made the decision to abandon Echo. 27 4. Method 4.4.4 Specifying Transformations in Medini QVT Due to the problems faced with Echo, we used Medini QVT to run QVT-R transfor- mations instead. The transformations already written could directly be reused. Due to limited developer literature for Medini and QVT-R in general, we faced several scenarios which were non-trivial to solve. We devised approaches for these scenarios, and together with the resulting transformation specifications, they are described as results in Section 8.4. A drawback of using Medini QVT is that it does not support model synchro- nization, meaning that RQ3c will not be fulfilled when running the transformations in Medini. 4.4.5 Creation of an Xtext C++ Grammar Similarly to the approach of defining an Xtext COBOL grammar, an Xtext grammar for C++ was defined. Lämmel and Verhoef’s [61]’s COBOL grammar does not support object-oriented COBOL. As such, we limited our source-to-source compiler to only support the sequential part of C++ (i.e. C-like part), with the exception from a few classes supplied by us to emulate COBOL data types. We found a C grammar defined by Terence Parr [62], implemented in ANTLR3, which was converted to an Xtext grammar. Unlike the COBOL grammar, this C grammar has not been rigorously tested. However, it sufficed to represent the sequential part of C++ which we required. Due to the fact that this C grammar was implemented in ANTLR3, it was trivial to convert to Xtext (since Xtext is built on ANTLR3). Therefore, using this C grammar was deemed preferable to converting a C++ EBNF grammar to Xtext. 4.4.6 Evaluating Results Due to the limited size of our prototype, we were able to perform manual tests on different combinations of each language constructs to evaluate our approach. Representative examples, based on code from the Applewood Computing Accounting System, were devised and used for evaluating mainly 3 different aspects: 1. Correctness - whether the translation is valid, in that the same output is produced in a piece of translated code as in the original code. Deeply nested constructs are of interest, due to their non-trivial nature. 2. Intent preservation - how the intent of a certain piece of code is preserved after a translation. 3. Construct preservation - how well non-changed constructs are left intact after a back-and-forth translation including and excluding changes to the code base. Note that unmodified extracts of code from Applewood Computing Accounting Sys- tem [59] could not be used due to the limited range of language constructs supported by our prototype. Instead, examples containing as much supported code as possible were gathered and the unsupported language constructs were removed. 28 4. Method In evaluating and discussing validity of transformations, we differentiated be- tween two types of validity: transformation validity and specification validity which are defined as follows: Definition 4.4.1. Transformation validity is regarded as guaranteeing that trans- formations produce a valid result, in both directions, according to specification. Definition 4.4.2. Specification validity is regarded as guaranteeing that relations are specified in a manner such that constructs in one language correctly relates to those in the other language, i.e. the meaning is correct. Since transformation validity is largely guaranteed by construct when using a bidi- rectional transformation language, the main focus the evaluation was specification validity. 29 4. Method 30 5 Survey of Cobol Language Construct Frequency This chapter presents the tool developed in order to analyze COBOL language constructs’ occurrence frequency, and the results obtained by running it against a sample project. Performing the analysis is interesting as the results can be used to scope the work in the remaining result chapters. The validity of generalizing the results are discussed at the end of this chapter. 5.1 The Developed Analysis Tool In order to measure occurrences of certain COBOL language constructs in a certain code base, the compiler GnuCOBOL [6] (formerly OpenCOBOL) was modified, by adding counting code to the parser generator specification (a formal grammar and AST-construction code is already specified together for use by the parser generators yacc or Bison in the existing source). A total of 429 different grammar rules, that we deemed relevant to distinguish between, are distinctly recognized. This approach allows every program that can be compiled with GnuCOBOL to be analyzed by simply compiling them, using pre-existing build scripts of the examined project. 5.2 Analysis Results The most commonly used language constructs used in Applewood Computing Ac- counting System [59] are listed in Table 5.1. In summary, basic constructs such as IF, ADD, MOVE etc. are used in this COBOL source too. The most used constructs were record declarations and PICTURE clauses, similar to struct and variable decla- rations. VALUE IS and USAGE IS are related to these variable definitions. We also saw that the DISPLAY and ACCEPT commands, responsible for console output and input, were important. 31 5. Survey ofC obolLanguage C onstruct Frequency Table 5.1: Summary of the 50 most commonly used language constructs in the COBOL program Applewood Computing Accounting System. Language construct Occurrences Language construct Occurrences RECORD declaration 41660 PICTURE declaration 30980 VALUE IS (on PICTURE) 17935 MOVE 11919 CONDITION (88 level) 7934 IF 5287 DISPLAY 4022 CONSTANT ENTRY (78 level) 3301 (DISPLAY) WITH FOREGROUND-COLOR 3288 GOTO 3187 PARAGRAPH declaration 2700 USAGE IS BINARY-LONG SIGNED 2671 PERFORM _ 1699 Screen declaration 1581 Screen opt COLUMN 1528 REDEFINES (RECORD) 1417 USAGE IS BINARY-CHAR SIGNED 1233 USAGE IS BINARY-SHORT SIGNED 1215 ADD _ TO _ 1153 USAGE IS COMP 1062 WRITE 1050 ELSE 1036 Screen opt LINE 849 (WRITE) BEFORE/AFTER ADVANCING lines 844 (RELEASE/WRITE/REWRITE) FROM 814 Section header declaration 801 USAGE IS COMP-3 793 ACCEPT 776 OCCURS 719 EXIT SECTION 645 (EVALUATE) WHEN 570 (STRING) DELIMITED BY 553 (accept opt) FOREGROUND-COLOR 538 CALL 510 (CALL/ENTRY) USING 509 (accept opt) UPDATE/DEFAULT 435 USAGE IS BINARY-CHAR UNSIGNED 419 CLOSE 416 (STRING/UNSTRING) WITH POINTER 363 Screen opt USING 351 STRING 343 READ 335 (DISPLAY) WITH HIGHLIGHT 323 (DISPLAY) WITH ERASE EOL 311 FILE-CONTROL ASSIGN 296 FILE-CONTROL SELECT 296 FD (file type) 285 (DISPLAY) WITH ERASE EOS 265 SUBTRACT _ FROM _ 262 Screen opt FOREGROUND-COLOR 236 32 5. Survey of Cobol Language Construct Frequency 5.3 Limitations and Validity of Generalization A limitation of the developed tool is that it only supports analysis of COBOL programs compilable by GnuCOBOL (including pre-processing). As we only tested the tool on one program, we cannot be certain the results obtained are generalizable. However, we have no indications that they should not generalizable, except that the system under test is a console program, hinting that DISPLAY and ACCEPT might be used differently in more integrated programs. 33 5. Survey of Cobol Language Construct Frequency 34 6 Choosing a Target General-Purpose Language This chapter explains why we chose C++ as an optimal target language. First, a list of popular languages (relating back to RQ1 that the language should be well- known) is obtained. Second, a comparison between these languages is given. Last, a discussion follows, concluding that only a subset of C++ (a hybrid between C and C++) will be supported by our work, due to time limitations. 6.1 The Comparison The software quality company Tiobe compiles a list of languages each month, rank- ing them by popularity [3]. At the time of writing, the most recent one is Tiobe Index for April 2017, which places the following languages, in order, as the most popular ones: Java, C, C++, C#, Python, PHP, VB.Net, and JavaScript. These languages will be used in the comparison. The headers below match the list of criteria introduced in the Method section. 6.1.1 Static or Dynamic Typing The selected languages can be classified according to the two paradigms of loosely typed scripting languages (Python, PHP, VB.NET, and JavaScript), and typi- cally compiled statically type-checked languages (Java, C#, C, C++, along with COBOL). Due to this difference, we decided to remove the scripting languages from the remainder of the comparison and focus on Java, C#, C, and C++. For sources supporting the claims made for each respective language, refer to the language specifications for Java [63], C [64], C++ [65], C# [66], and COBOL [67]. 6.1.2 Memory Management and Environment One important difference between Java and C# on one side, and C, C++, and COBOL on the other, is how memory is managed. In COBOL, from what we saw in the Language Construct Survey, most memory is allocated statically at the beginning of a program, and is available there until program execution halts. On function calls, memory may also reside on the stack for parameters and return values. Even though 35 6. Choosing a Target General-Purpose Language we did not see it in actual code, it is possible to allocate memory on the heap [68]. Such heap memory must be returned manually to the system. Both C and C++ are similar in this regard. Java, on the other hand, only allows primitive data types to be allocated statically and stored on the stack. More complex objects need to be heap-allocated. Additionally, memory is managed through a garbage collector. C# makes the distinction in the language by heap-allocated and garbage collected classes, and stack-allocated structs. 6.1.3 Primitive Types The primitive types in C#, C, and C++ allow for storing numbers in signed and unsigned form, compared to Java, which only allows signed numbers (except for char, which is 2 byte unsigned). COBOL has a complex type system with pictures (PICs) that store numbers or strings. The numbers can be signed or unsigned, and contain a decimal, which might be implied (not present in memory). The memory representation can be binary (the same as in C/C++/C#/Java), or text-based, so that the number can be printed directly as a string. Additionally, it can be stored as a packed form of the text-based format. All languages support 32 and 64-bit floating precision numbers (typically IEEE-floats). Unlike the other languages, COBOL strings encode information about format- ting, which can be arbitrarily specified by the programmer. A string, for example, might contain three letters followed by three digits, separated by an implied decimal (i.e. a comma not stored in memory at runtime, but still included when formatting the string). This information is stored in the type itself. When outputting these strings to the console, using the DISPLAY command, they are formatted with for ex- ample implied decimal printed and leading zeroes converted to spaces [69]. Also, as already hinted at in the COBOL language construct frequency survey, the DISPLAY command can manipulate the console appearance, something not natively supported in the other languages. For example, a majority of all DISPLAY commands in the analyzed code had the option to display WITH FOREGROUND COLOR. These features should be possible to emulate in all the other languages by helper functions. 6.1.4 Classes and Objects COBOL has the notion of records, where primitive data types and other sub-records are stored together, like a struct in C, C++, or C#. A class in C++, C# and Java is also more or less similar. Of important note are REDEFINES records, where a sub-record aliases another in memory, implemented in C# as field offsets, and similar to a union in C or C++. In Java, such functionality is not directly available to the developer. While C and C++ do not guarantee memory storage location, i.e. where in a struct a member is stored, or using unions to read aliased memory, it is so common to do so anyway that it almost is an informal standard according to our experience. COBOL also supports classes since a recent version, but we have not investigated the matter further, as the language construct survey revealed they were not used in practice (it can also be argued that they are of less practical interest, as the vast majority of existing COBOL code was written before they were introduced 36 6. Choosing a Target General-Purpose Language into the language). 6.1.5 Functions COBOL, C, and C++ allow something similar to a function to be called. Argu- ments are passed on the stack and returned there (or via processor registers, at the implementation’s discretion). Similarly, Java and C# allow methods to be called on objects, even though separate functions are not allowed. Of note again is that whole objects cannot be passed on the stack in Java - a reference to a heap-allocated object must be used. C# and C++ allow operator overloading, where for example arithmetic operators such as ’+’ and ’*’ are mapped to user-defined functions. A COBOL function (a paragraph in COBOL terminology) may not call itself recur- sively, in contrast to its C, C++, C# and Java counterparts. 6.1.6 Basic Syntax The syntax of basic operations and control flow are very similar in C, C#, C++ and Java. The concepts are similar in COBOL too, but COBOL tends to use separate statements for basic arithmetics (ADD, ADD-TO, MULTIPLY etc.) and a separate COMPUTE-statement for more complex arithmetic expressions. In C, C++, C# and Java, everything tends to be an expression, regardless of how complex it is. One important difference in COBOL compared to the other languages is that it can handle integer arithmetic overflow by natively controlling program flow using simple branching (like an if statement). Another important difference is that data can flow from several sources to several targets, compared to C/C++/C#/Java which is more single-data load-store oriented. An example of this is the ADD-statement, which allows several sources to be added together and then stored at several destinations, all in one statement. 6.1.7 Pre-Processing and Meta-Programming C, C++, and COBOL have pre-processors that allow simple macro expansion. C++ provides advanced meta-programming features through templates, allowing code run at compile-time to decide how a type or function should be constructed. Java and C# have simple template-like semantics implemented as generics, mimicking C++ templates in the sense that a type may take another type as an argument to construct the final instantiation at compile-time. C++ templates, and C++’s ability to reference memory byte-by-byte and bit-by-bit, allows all COBOL data types to be emulated in a bit-perfect manner. Considering that there are theoretically an infinite number of data types in COBOL (as they are all custom), this is an important result. Furthermore, operator overloading in C++ can provide native arithmetic and casting support for these types. Paired together with structs and unions, COBOL records can be emulated directly, as the memory can be forced to align perfectly. As previously mentioned, these custom types can be stored statically, on the stack as function parameters, and be passed as function return values in C++. Finally, the COBOL-equivalent of an array (a table) and its access semantics, can be similarly translated using a C++ template implementation. 37 6. Choosing a Target General-Purpose Language 6.2 Discussion Given that COBOL data types can be constructed in C++, it seems to be the best choice given our constraints. It is also evident how Java’s memory model might interfere with programming, as it is constrained to garbage-collected references in many cases. This might require developers to write code in a very specific manner, not very like how they are used to, making the popularity of the language less relevant. Writing the code to fit Java-to-COBOL translation might also interfere with standard Java compile-time checks about what is allowed or not on regular static types. These reasons might defy the purpose of having a transformation altogether. Comparing C and C++, the latter is more or less a version of the former with new features. In order to fit the scope of the thesis, we chose to support a special variant of C++ that allows us to use the C++ features for implementing our standard library supporting COBOL data type emulation. User code is, on the other hand, restricted to what transformations are supported (see chapter 8), more like the basic C syntax. This restriction simplifies translation, as templates and overloaded operators need not be supported in the transformations. As already mentioned, classes are not supported either, due to scoping of our work. 38 7 Emulating COBOL Data Types in C++ This chapter describes the standard library we developed for emulating COBOL data types in C++. The results will be used in chapter 8 when answering research question RQ3, for when a COBOL data type that have no native C++ counterpart is translated. The standard library was realized with C++ templates and structs with opera- tor overloads in these. Three different data types were realized: one for the default decimal numeric type (based on EBCDIC-encoded strings), one for the computa- tional field type packed decimal, and one for regular strings. We concluded that other types of so-called computational fields could be translated to regular integral types and floating point numbers. COBOL records can be represented with C++ structs, and REDEFINES records with unions, as the developed data types are stored exactly as their equivalent COBOL counterparts, bit by bit. Furthermore, COBOL tables may be represented as C++ arrays, and loop indexing by the native memory indexing type std::size_t. The COBOL type FILLER (that occupies dummy memory) is trivially represented by an empty array of bytes. Using these conversions allows every data type representable in COBOL to also be representable in C++. 7.1 Enhanced Byte Arrays A native array of raw bytes (chars) in C++ cannot be directly passed by-value on the stack [65]. This limitation was circumvented by defining a templatized struct, ByteArray, containing an array of the given size, as in Listing 1. The ByteArray could then be used to have full control over the memory when defin- ing the emulation types. It should be noted that the compiler is allowed to lay out the memory as it wants [65], but we rely on that it will order fields in the order they are defined in a struct (in our experience, this is the common case). Furthermore, we rely on that the compiler does not try to align the memory addresses (e.g. on 32-bit or 64-bit boundaries), a feature that might be needed to be turned off in some implementations. 39 7. Emulating COBOL Data Types in C++ typedef unsigned char byte; template struct ByteArray { private: byte data[L]; public: ByteArray() : data() {} byte& operator[](std::size_t idx) { return data[idx]; } const byte operator[](std::size_t idx) const { return data[idx]; } }; Listing 1: An array of raw bytes that can be passed by-value on the stack. 7.2 The Default Decmial Type The default COBOL data type allows numbers to be stored in the EBCDIC format (similar to an ASCII string, but another encoding) [67]. The four least significant bits contain the number (0-9), and the four most significant ones are normally set high (1111). The data might be signed, where the sign mask replaces the four most significant bits on either the first or the last byte; which alternative is used can be configured by code when declaring the data type. Finally, the number may contain decimals, and the decimal sign (a dot) may be implied, meaning that it is not explicitly stored as a byte in the data. Implied decimal is the default. The byte-width of the integer and decimal part is configured in code when declaring the data type. The emulation type was realized as a templatized struct, shown in Listing 2, which allows all mentioned parameters to be specified into the compile-time type. Of important note is how the expression defining the byte width can be resolved at compile-time and baked into the instantiated type. Overloaded operators on the type, allowing addition and add-assign as examples of arithmetic operators, as well as casting to strings and doubles, are also of interest. To enhance readability, the constructors that instantiate specific instances to zero, a given default value, or a copy of the packed decimal type discussed below (its internals are accessed by making it a friend), are not shown. 40 7. Emulating COBOL Data Types in C++ 7.3 Packed Decimal COBOL also allows packing a default decimal number into almost half the bit-size, by removing the four most significant bits, thus storing two digits into the same byte [67]. Given that the resulting type cannot be directly printed, the decimal character is always implied, and thus not stored. In order to store the sign, some extra four bits are required, at the least significant position. For an odd number of digits, the sign bits can be baked into the type. For an even number of digits, the four most significant bits on the first byte are set to zero, and the four least significant bits on the last byte contain the sign. It also happens to be the case that four sign bits are reserved even for unsigned numbers, in which case they are set to all ones. The packed decimal emulation type (see Listing 3) was realized similarly to the unpacked version. Its internal implementation temporarily converts to the unpacked version for calculations, and then back (optimizing for code simplicity over perfor- mance). While the cast overloading technically can handle all cases, C++ does not automatically use it e.g. when adding a packed decimal to another packed decimal. Therefore, the operators are overloaded in this type too. When adding a packed decimal to the default unpacked version, casting is performed and the arithmetic operation is handled directly in the default type. 7.4 Strings COBOL also allows its data types to take an arbitrary form of text and numbers, possibly formatted in a certain way [67] (see chapter 6 for details). Thus, the formatting is part of the type information. As a data type that contains text cannot be used directly in numerical calculations (without converting it first, e.g. by using REDEFINES records and overlaying a decimal type padded by FILLER), they are represented by a separate string type. The multi-byte string type Double-Byte Character Set (DBCS) is not explicitly supported by our implementation. However, the implementation should be able to handle such data transparently, but there is no way of specifying literals of such type in C++, and we have not developed any value-converter as a workaround. The string type was realized using a template that poses as a tuple, allowing dif- ferent pieces of the string with different formatting to be concatenated together. For example, the code PIC, C, Z<3>, V, N<2>> WS_HELLO("Hello,00314"); de- fines: • five characters of any type (X), • followed by a comma (C), • followed by three digits, where leading zeroes should be replaced with space when formatting the string (Z), • followed by an implied decimal that should not be stored in memory, but still printed when formatting the string for output (V), 41 7. Emulating COBOL Data Types in C++ • and finally followed by two regular digits (no special formatting of leading digits) (N). Listing 4 and 5 shows how the types were implemented in more detail. The tuple is recursively defined and accessed using tail recursion, with inspiration from the method presented by [70], but using types instead of literals. Note how each type (X, C, Z, V, N etc.) has a cast-to-string operator overload and occupies as much memory as needed. Note also how the V option is only part of the compile-time type, as it has no members and thus occupies no memory at run time. 7.5 Summary We have presented one way to represent COBOL data types in C++. Relating back to the feature goals in the Method section, trivial tests shown that all listed features work. The storage is bit-perfectly stored in byte arrays. Overloads on the encapsulating integer types allow simple arithmetic operations (with overflow detection) and casting between formats. String types can be formatted properly. COBOL RECORDS and REDEFINES can be handled by native language constructs (C++ struct and union), given some assumptions on the compiler, notably outside the language specification but as commonly implemented by vendors. The results obtained can thus be used when answering research question RQ3. 42 7. Emulating COBOL Data Types in C++ template