Creating a Bi-Directional Source-to-Source
Compiler Using MDE Transformation Tech-
niques
A Proof-of-Concept Using the General-Purpose Programming
Languages COBOL and C++

Sebastian Blomberg and Joel Severin

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2017


Master’s thesis 2017

Creating a Bi-Directional Source-to-Source
Compiler Using MDE Transformation Techniques

A Proof-of-Concept Using the General-Purpose Programming
Languages COBOL and C++

Sebastian Blomberg and Joel Severin

Department of Computer Science and Engineering
Division of Software Engineering

Chalmers University of Technology
University of Gothenburg

Gothenburg, Sweden 2017


Creating a Bi-Directional Source-to-Source Compiler Using MDE Transformation
Techniques
A Proof-of-Concept Using the General-Purpose Programming Languages COBOL
and C++
Sebastian Blomberg and Joel Severin

© Sebastian Blomberg and Joel Severin, 2017.

Supervisor: Regina Hebig, Department of Computer Science and Engineering
Examiner: Jan-Philipp Steghöfer, Department of Computer Science and Engineer-
ing

Master’s Thesis 2017
Department of Computer Science and Engineering
Division of Software Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Gothenburg, Sweden 2017

iv


Creating a Bi-Directional Source-to-Source Compiler Using MDE Transformation
Techniques
A Proof-of-Concept Using the General-Purpose Programming Languages COBOL
and C++
Sebastian Blomberg and Joel Severin
Department of Computer Science and Engineering
Chalmers University of Technology

Abstract
In this thesis, we discuss how a bidirectional source-to-source compiler, for the
declarative general-purpose languages COBOL and C++, can be implemented us-
ing Model-Driven Engineering (MDE) tools and practices. The outcome of our work
is primarily an approach for implementing said bidirectional compiler using formal
grammars, bidirectional transformation languages, and a developed concept model.
This approach also illustrates how a programmer’s intent can be transferred between
languages. In order to evaluate the approach, a prototype was realized using Ecore,
Xtext, and Medini QVT. In the process, a library for emulation of COBOL data
types in C++ was also developed. Finally, we conclude the success of the developed
approach and prototype.

Keywords: source-to-source compiler, transpiler, translator, bidirectional transfor-
mation, model-driven engineering, COBOL, data type emulation.

v


Acknowledgements
We would like to thank our supervisor Regina Hebig for steering us in the right
direction every time we lost track or had questions. We would also like to thank
everyone else that helped us improve our work, including our supervisor Jan-Philipp
Steghöfer, and our opponents Peter Eliasson and Jakob Csörgei Gustavsson. Kindly,
thank you.

Joel Severin and Sebastian Blomberg, Gothenburg, June 2017

vii


Contents

1 Introduction 1
1.1 Motivation in Relation to Industry . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations and Limitations . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Formal Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Metamodeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Bidirectional Transformation Languages . . . . . . . . . . . . . . . . 11

2.4.1 Triple Graph Grammars . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 QVT Relational . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Text-based approaches . . . . . . . . . . . . . . . . . . . . . . 13

3 Related Work 15
3.1 Bridging Grammars and Metamodels . . . . . . . . . . . . . . . . . . 15

3.1.1 Xtext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Translation of General-Purpose Languages . . . . . . . . . . . . . . . 18

3.2.1 Early Attempts at Bidirectional Translation . . . . . . . . . . 18
3.2.2 The Idea of Concepts Common to Code between Languages . 20
3.2.3 Formalizing Bidirectional Translation . . . . . . . . . . . . . . 20

4 Method 23
4.1 Survey of Cobol Language Construct Frequency . . . . . . . . . . . . 24
4.2 Choosing a Target General-Purpose Language . . . . . . . . . . . . . 24
4.3 Emulating COBOL Data Types in C++ . . . . . . . . . . . . . . . . 24
4.4 Creating a Model-Driven Source-to-Source Compiler . . . . . . . . . . 25

4.4.1 Creation of an Xtext COBOL Grammar . . . . . . . . . . . . 26
4.4.2 Creation of an Intermediate Model . . . . . . . . . . . . . . . 27
4.4.3 Specifying Transformations in QVT-R using Echo . . . . . . . 27
4.4.4 Specifying Transformations in Medini QVT . . . . . . . . . . 28
4.4.5 Creation of an Xtext C++ Grammar . . . . . . . . . . . . . . 28
4.4.6 Evaluating Results . . . . . . . . . . . . . . . . . . . . . . . . 28

ix


Contents

5 Survey of Cobol Language Construct Frequency 31
5.1 The Developed Analysis Tool . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Limitations and Validity of Generalization . . . . . . . . . . . . . . . 33

6 Choosing a Target General-Purpose Language 35
6.1 The Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Static or Dynamic Typing . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Memory Management and Environment . . . . . . . . . . . . 35
6.1.3 Primitive Types . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.4 Classes and Objects . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1.6 Basic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1.7 Pre-Processing and Meta-Programming . . . . . . . . . . . . . 37

6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Emulating COBOL Data Types in C++ 39
7.1 Enhanced Byte Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 The Default Decmial Type . . . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Packed Decimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Creating a Model-Driven Source-to-Source Compiler 47
8.1 Transformations between Code and Model . . . . . . . . . . . . . . . 48
8.2 Information in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8.2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.3 The Intermediate Concept Model . . . . . . . . . . . . . . . . . . . . 52

8.3.1 Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3.3 Arithmetic Expression-Assignments . . . . . . . . . . . . . . . 55
8.3.4 Conditional Branching . . . . . . . . . . . . . . . . . . . . . . 58
8.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.3.6 Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.4 Transformation between Concept Model and Language Models . . . . 64
8.4.1 Basic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.4.2 Delegated Relations . . . . . . . . . . . . . . . . . . . . . . . . 66
8.4.3 Enforcing Order . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.4.4 Dealing With Strings in Medini QVT . . . . . . . . . . . . . . 71
8.4.5 Implementing Alternate Relations . . . . . . . . . . . . . . . . 72
8.4.6 Conditional Relations . . . . . . . . . . . . . . . . . . . . . . . 74

8.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.5.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.5.2 Intent Preservation . . . . . . . . . . . . . . . . . . . . . . . . 79
8.5.3 Construct Preservation . . . . . . . . . . . . . . . . . . . . . . 81

8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

x


Contents

9 Conclusion 89
9.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.2.1 On the Language Construct Analysis . . . . . . . . . . . . . . 90
9.2.2 On Emulating COBOL Data Types . . . . . . . . . . . . . . . 91
9.2.3 On the Source-to-Source Compiler . . . . . . . . . . . . . . . . 91

9.2.3.1 Choosing a Good Solution . . . . . . . . . . . . . . . 91
9.2.3.2 Eliminating the Intermediate Model . . . . . . . . . 91
9.2.3.3 Translation within a Language . . . . . . . . . . . . 92
9.2.3.4 Usages beyond Translation . . . . . . . . . . . . . . 92

Bibliography 93

xi


Contents

xii


1
Introduction

A natural part of the technological evolution is the eventual decrease in popularity,
and usage, of old technologies. In the field of Software Engineering, one such tech-
nology is the general-purpose programming language COBOL (common business-
oriented language). Its replacements include more popular languages, such as Java,
C, C++, C#, Ruby, Python, recently even JavaScript. Our work aims to ease
the transition between COBOL and a replacement language, by transforming old
COBOL code to a new language, and then automatically propagating changes made
in the new code back to the old COBOL code base. This advancement allows devel-
opers with knowledge of only one of the languages to work together, as development
in both languages can co-exist. Hopefully, important system and domain knowledge
can be transferred between programmers of different generations in the process.

1.1 Motivation in Relation to Industry
Conducting research about transforming COBOL code into a more modern language
may be motivated by current circumstances in the industry, including:

• there has been a vast decrease in education of new COBOL programmers [1],

• existing COBOL programmers are nearing retirement age [2],

• a majority of IT managers raise concerns about a current or future shortage
of COBOL-knowledgeable developers [2],

• COBOL is not a popular language anymore [3, 4],

• COBOL is still widely used, especially in the financial sector [5] 1.

There are examples of products which transform COBOL into C [6] and COBOL
into Java [7]. However, these tools only support migration of COBOL code, meaning
that the transformation is unidirectional, and after the transformation is done code
is only written in the new language. As a consequence, the underlying system (often
a mainframe) may need to be replaced. This presents a challenge for several reasons.
First, it may be costly [2]. Second, it may present uncertainty, as there sometimes
seems to be a feeling that the old systems, as quoted by a COBOL application

1Datamonitor [5] estimates that there were 200 Billion lines of COBOL code in usage 2009, and
5 Billion new lines being added each year. Additionally, they estimated that 90% of the world’s
financial transactions were processed in COBOL.

1


1. Introduction

manager, “work pretty well” [2]. Third, when existing COBOL developers retire,
knowledge is lost. Finally, there might be serious performance issues when migrating
to a new system [8]. By utilizing bidirectional translation of code, a COBOL system
can remain intact and continue to be developed in C++, thus mitigating some of
the factors described above.

Replacing the COBOL language might, however, not be a solution to all prob-
lems associated with legacy COBOL systems. While COBOL presents a language
barrier, the systems themselves might have a complex design, and the COBOL-
knowledgeable workforce not only knows COBOL, but also the domains of these
systems.

1.2 Research Questions
The following research questions will guide the thesis work:

RQ1 Which are the most common COBOL constructs used in practice?

RQ2 Which well-known language would be appropriate to use as target, in a trans-
formation from COBOL?

RQ3 How can a bidirectional transformation be applied in order to transform be-
tween COBOL (source) and the well-known language (target), such that:

(a) the resulting code is correct according to the target syntax,
(b) at run-time, both programs behave the same in terms of output, given

the same input,
(c) changes in the target code base propagate back to the source code base,

such that as much of the COBOL code as possible is left intact, with
focus on locality of change,

(d) and such that language constructs in the source language are translated
in a manner such that intent, to the greatest extent, is preserved in the
corresponding target language constructs?

where we define intent in Definition 1.2.1. RQ2 and RQ1 are questions which need
to be answered in order to 1) determine the focus and limitations of our study, and
2) to design the actual transformations in the main research question RQ3.

Definition 1.2.1. Intent the implicit encoded reason why a piece of code is expressed
in one way, when there are multiple ways to achieve the same goal.

1.3 Purpose
The purpose of our work is to provide an approach for implementing a bidirectional
source-to-source compiler using modern transformation techniques, including (but
not limited to) techniques from the field of Model Driven Engineering (MDE). Fur-
thermore, a prototype implementation will be presented as a proof of concept for
the provided approach.

2


1. Introduction

1.4 Delimitations and Limitations
One part of the outcome of the work we report on is a research prototype. As such,
it is neither complete nor fully tested. However, it serves as a proof-of-concept that
can be extended and generalized as the field of bidirectional language translation
evolves. The prototype does not take code style into account. Instead, focus is
put on the actual content of the code and approaches to translate it. Furthermore,
although the approaches presented in this thesis are quite general, we make no
claim of generality for programming languages dissimilar to the ones supported in
the prototype.

A comment on the usefulness of our work should be made. It could be that
learning COBOL is a relatively small task, compared to understanding the complex
systems written in COBOL. It could be theorized that most of the COBOL systems
still in existence today are the most complex ones, since the more simpler ones have
already been replaced by modern technologies. Our work will not solve the issue
with complex system knowledge, but it could still be useful for a human wanting to
adopt such knowledge.

1.5 Disposition
The remainder of this thesis follows the structure of first presenting background
theory, and then related work of direct relevancy to our contribution. After that,
our methods of conducting work are presented. Next, our contribution is reported
on and discussed in chapters 5 to 8. Finally, a conclusion and some ideas for future
research are presented. In the part concerning our contribution, we iteratively report
on how we solved several problems in several chapters (this constitutes our results),
enabling us to reach our goal of answering all research questions.

3


1. Introduction

4


2
Theory

This chapter introduces the foundations on which the related work (next chapter)
of direct relation to our contribution, and our contribution, are built upon. Since
our work covers multiple sub-fields of software engineering, a short summary with
motivations follows to orient the reader. First, formal grammars are described,
which can be used to parse code in text form and generate a model from it. As a
specification, formal grammars also allow text to be generated from the grammar
rules it encodes, possibly from an existing model. Second, the notion of a model is
formalized, with the introduction of metamodels. The mappings between text and
model are essential to our contribution and are further elaborated on in the Related
Work chapter. Finally, transformations, especially those moving information be-
tween different models (as opposed to text), are described. The important property
of a transformation being bidirectional (similar to reversible) is explained in depth.

2.1 Formal Grammars
A formal grammar (henceforth referred to simply as a grammar) contains a set of
production rules, linking a list of nonterminal and terminal symbols, to another list
of such symbols [9]. The terminal symbols represent the elementary symbols of a
language, while the nonterminals contain a list of terminals and other nonterminals.
Also, the grammar contains a start rule, imposing some ordering on the application
of the production rules. Given a specification of a grammar, a certain language
can be parsed by a machine. Certain constraints can be applied to the grammar,
restricting which types of languages it accepts.

Chomsky [9] proposed classifying grammars using a hierarchy of constraints,
limiting the languages it accepts. The hierarchy orders four different classes of
grammars, presented from general at the top, to less general at the bottom, in
Figure 2.1. A certain grammar class is able to accept more than all languages all
less general grammar classes are able to accept. At the bottom of the hierarchy,
regular languages are found, e.g. those parseable by a regular expression. Regular
languages allow linking one nonterminal to exactly one termianal, optionally either
preceded or followed by a nonterminal. One level up, the context-free languages are
found. These allow linking one nonterminal to a list of terminals and nonterminals.
At the next level, context-sensitive grammars reside, allowing a context of terminals
and nonterminals around both sides of a rule (the context has to be the same on both
sides of the rule). Finally, the most general class contains the recursive enumerable
grammars, allowing any terminals and nonterminals on both sides of a rule, thus

5


2. Theory

removing the restriction that the context must be the same, as in context-sensitive
grammars.

Backus [10] proposed to standardize specification of production rules linking
nonterminals to a list of terminals and nonterminals, allowing context-free grammars
to be specified by a textual notation now known as the Backus-Naur Form (BNF).
Grammars for general-purpose languages are typically expressed by the de-facto
industry standard Extended BNF (EBNF) syntax, now standardized as ISO/IEC
14977 [11]. EBNF is an extension to BNF that adds syntactical sugar which allows
easier specification of repetition (e.g. the Kleene star *, as also found in regular
expressions).

Figure 2.1: Chomsky’s hierarchy of languages acceptable by their respective
grammar class (replace languages with grammars). Anything is a terminal or

non-terminal. LeftContext and RightContext are terminals or nonterminals, but
need to be exactly the same on both sides of the arrow (to define a context).

In order to facilitate parsing of context-free languages by algorithms run on to-
day’s computers, Knuth [12] proposed that a suitable subset would be LR grammars,
denoting grammars that can parse an input stream by reading it left to right once
(without backing up to reconsider previous choices). The algorithms would then ex-
hibit linear complexity with respect to the input length. More specifically, a LR(k)
grammar is allowed to have k tokens of lookahead, meaning that the parser can peek
at no more than k upcoming lexical tokens in the input stream. LR-parsers have
later become known for their other properties: they produce a rightmost derivation,

6


2. Theory

resolving production rules from right-to-left, and thus produce a bottom-up parse
(where the smallest elements of a language are recognized first) [13].

Some additional constraints imposed on the special case LR(1), also known
as Canonical LR (CLR), have allowed the less resource-intensive Look-Ahead LR
(LALR) and Simple LR (SLR) algorithms to be developed [13]. In fact, these the-
oretically cover all countable k, as it can be proven that for all LR(k) grammars,
there exists an equivalent LR(1) grammar. Additionally, the Generalized LR (GLR)
parser builds upon LR parsing, making it less restrictive (but have the same com-
plexity) [14]. These specializations of the LR parser have become popular choices
for implementation in compilers, and consequently even language specifications are
written with respect to them. For example, the popular parser generator yacc, and
its successor Bison, are based on these algorithms [15].

In contrast to LR parsing, a different family of algorithms that have become
known as LL parsers has been discovered [13]. These still read the input stream from
left to right, but perform a leftmost derivation, resolving rules from left to right, thus
producing a top-down parse (where the largest elements of a language are recognized
first). For example, the popular parser generator ANTLR uses LL(*)-parsing (from
version 4 ALL(*)-parsing), where the star denotes that k may be infinite [16].

For completeness, we mention that parsers are often table-based, meaning that
they are built using pre-computed decision tables for which symbol is read, and
what is in the look-ahead [13]. There also exist recursive-descent LL-parsers, and
recursive-ascent LR-parsers, where the decisions are taken by actual program code,
written natively in the programming language together with the parser. Lexers and
parsers are typically generated by lexer and parser generators, i.e. tools which, based
on a grammar, are able to automatically generate actual parser programs in native
source code (C, Java etc.).

Code of a certain programming language is typically read by a compiler using
a two-step process of first applying lexical analysis to the input text stream using
a regular language, then parsing them with a restricted context-free grammar (e.g.
with a LALR parser) [17]. The lexer tokens provide a symbol representation of the
underlying text stream, consisting of a name and an optional attribute (for example
(INT, 42), when lexing the integer 42). The token names are then interpreted as
nonterminals for the next stage, where the parser uses the context-free grammar to
produce an abstract syntax tree (AST), for further use by the compiler.

2.2 Metamodeling
The code-centric development approach, i.e. defining programs by writing code,
is the prevalent and traditional way of developing software. One explanation for
why this is the case might be that the majority of all general-purpose programming
languages are defined by grammars, meaning that somewhere along the way code
must be written to produce software.

Another approach to describe and develop software is using models. In the field
of software engineering, Seidewitz [18] defines a model as “a set of statements about
some system under study (SUS)”, where a statement is “some expression about the
SUS that can be considered true or false”. This definition is rather general and

7


2. Theory

for a model to be meaningful there needs to exist an interpretation. Seidewitz [18]
defines an interpretation as a “mapping of the model’s elements to elements of the
SUS such that we can determine the truth value of statements in the model from
the SUS”. For example, suppose we have an Entity-Relationship (ER) model. The
model contains expressions of which data to be stored and relationships between
collections of these data expressions (entities). If we say that our SUS is a relational
database, the interpretation of the model is that an entity is mapped to a table, an
attribute to a column, a relationship to a foreign key constraint, and so on. In this
case, the statements of the model are true if their mapped counterparts in the SUS
exist and conform to the model. If all statements in a model are true, Seidewitz [18]
defines the model as being correct.

A model can be expressed through the use of a modeling language. For example,
in order to express an ER model, there needs to be a definition of how and what it
can contain (e.g. entities, attributes, etc.), as well as how the different components
relate; this is defined in a modeling language. The Unified Modeling Language
(UML), containing language definitions for a wide range of applications, such Class
diagrams (models) and State machines, is the predominate modeling language for
software.

A modeling language is itself defined by another model, referred to as a meta-
model. A metamodel expresses statements which any model created in a modeling
language needs to adhere to. As such, the interpretation of a metamodel is map-
pings between its elements to the elements of the modeling language, and therefore
a metamodel determines the validity of a model created in a modeling language [18].
An analogy can be made between a metamodel and a grammar. A grammar is a
textual specification (a set of rules) of how a piece of code (text) may be written
to be valid. Similarly, a metamodel is a specification of how a model, created in a
certain modeling language (a set of rules), may be created to be valid.

Since a metamodel is actually a model, it can be expressed in a modeling lan-
guage, implying that a metamodel can be defined by a meta-metamodel. In turns,
that meta-metamodel can be defined by a meta-meta-metamodel. Theoretically,
there is no limit of how many metamodels there can exist in a chain of metamodels.
However, at some point they are bound to be superfluous, specifying trivialities. In
such a case, a metamodel can be used to describe itself.

Within model-driven software engineering, there is a four-layer architecture
traditionally used when referring to layers of metamodels, as depicted in Figure 2.2
[19]. At the top layer M3, there is a general (meta) model specified in the Meta-
Object Facility (MOF) standard [20]. This M3 model is self-contained, meaning that
it can be used to describe itself. An M3 model can also be used to describe an M2
model. The most commonly used M2 model is the UML metamodel, although the
MOF standard does not limit the choice to UML. Layer M1 classifies user-defined
models, conforming to the metamodel in layer M2. For example, a layer M1 user-
defined UML model could be a model for modeling users in a system. Continuing
the example, the corresponding layer M0 model would be the instance of the M1
(meta) model, i.e. the data which describes the actual user. See the right-hand side
in Figure 2.2 for an illustrative example.

8


2. Theory

Figure 2.2: An illustration of the four-layer architecture. The left-hand side of
the figure illustrates the relations between the models of the four layers, whereas

the right-hand provides an example of usage.

2.3 Transformations
Transformations is a mechanism for establishing consistency between, or create new,
artifacts through the means of a relation, where an artifact might be for example a
text, a model, or any other kind of formally defined information. Transformations
are used in a variety of areas including Software Engineering, Programming Lan-
guages, and Databases. A simplistic definition of a unidirectional transformation is
as follows r : S → T , where r is the function which specifies the relation between S,
the set of source artifacts, and T , the set of target artifacts [21]. By applying the
transformation, a target t that complies with the source s, according to the relation
r, can be produced. Consistency between a source and a target is considered true
if and only if t = r(s). Note that in grammar-related literature, such as [22], a
transformation may be referred to as a translation when it is performed on text. In
general, it is intuitive to think of a translation as a transformation, since a trans-
lation transforms text, although context-dependent meanings might apply in some
cases. We will use translation and transformation interchangeably when referring
to the end-to-end process of translating code; note that a translation may consist of
several transformations that are not text-based, as long as the outcome is text.

Apart from unidirectional transformations, there are also bidirectional transfor-
mations where a relation between two (or more) artifacts applies in both directions.
Borrowing from Stevens [21], a bidirectional transformation can be defined by a
statement of consistency R(s, t) which is deemed true if the source s and the target

9


2. Theory

t is consistent in regards to relations R. For each relation in R, there are two di-
rectional transformations, one in each direction, such that −→R : S × T → S (forward
direction) and ←−R : S × T → T (backward direction); that is, given a pair of models
s and t, the transformation should be able to restore consistency in any direction
according to the relations R. It should be noted that in a truly bidirectional transfor-
mation, there is no distinction between which artifact is the source and which is the
target. However, in literature, the terms are still used when describing bidirectional
transformations.

Bidirectional transformations are inherently more complex than unidirectional
transformations. One of the main issues faced with bidirectional transformations is
dealing with non-trivial types of relations. A relation between two artifacts in S and
T , respectively, is bijective, if for every artifact s ∈ S there exists exactly one unique
artifact t ∈ T , i.e. there is a one-to-one mapping between artifacts (see Figure 2.3a).
Bijective relations are easy to handle, in the sense that there is always an obvious
way to calculate the inverse of a bijective transformation, i.e. the transformation in
the other direction. However, a bijective constraint may be too restrictive in many
cases [21]. Surjective relations, on the other hand, allow for more flexibility. A
relation is surjective if for every s ∈ S there exist an artifact t ∈ T (not necessarily
unique), i.e. there exist many-to-one mappings (see Figure 2.3b). However, there
is no definite way of calculating the inverse of a surjective transformation, since the
inverse of a many-to-one transformation is a one-to-many, where any of the options
are valid. We refer to this as the general case, see Figure 2.3c. Henceforth, we refer
to relations of a surjective and general case nature as non-bijective relations.

Figure 2.3: An illustration of the types of relation in transformations. (a) shows
a bijective relation, which is invertible. (b) shows a surjective relation S → T ,

which turns in to the general case (c) in the opposite direction T → S.

Although applicable in more places, non-bijective relations present problems
when restoring consistency. For example, which of the alternatives in a one-to-
many mapping to use; this is a matter of solving ambiguities. In some cases, it
might not matter since all are valid solutions, but in other cases it does. Consider,
for example, if s2 has been transformed into t2 in the situation in Figure 2.3b. There
are then 2 possible ways to transform t2 into an artifact part of S, namely either s2
or s3, as illustrated in Figure 2.3c. In some cases, e.g. if no changes were made to
t2, it might be important that s2 is chosen, since s2 was the original artifact before
transformation.

10


2. Theory

2.4 Bidirectional Transformation Languages
There are several approaches towards implementing a bidirectional transformation.
Perhaps the most natural method that comes to mind is to use a pair of unidirec-
tional transformations such that r1 : S → T and r2 : T → S, forming a relation r.
The problem with this approach is the absence of consistency validity. For example,
consider r1 to be a valid transformation. There is no guarantee that r2 is compli-
ant with r1 and produces the intended result according to the relation r, since it
is specified separately from r1. Poskitt et al. [23] propose a formal way of address-
ing this issue by automatically translating a pair of unidirectional translations to
graph transformations, i.e. states of computations presented as graphs, and nested
conditions, in order to apply graph transformation validation techniques to ensure
consistency. However, their work is not yet available as a tool for public use.

Another approach for implementing bidirectional transformations is using a
transformation language which supports bidirectional transformations. Such lan-
guages are able to express bidirectional transformations as one unit, i.e. they are
able to express the forward and the backward transformation simultaneously. The
benefit of this is that transformation validity is ensured by construct [24]. There is
a variety of different bidirectional transformation languages and tools operating on
different artifacts (e.g. models, text). Their methods of transformation also vary,
which may give substantially different results between implementations. The re-
mainder of this section is used to describe the most common of these transformation
methods, as well as a few bidirectional languages which use them.

In order to compare bidirectional transformation languages, we differentiate
between two types: those which are able to support model synchronization and those
which are only able to generate new models. Supporting model synchronization
entails that information from an old version of a model (artifact) can be used to
recreate, i.e. synchronize, a new version, meaning that potentially less information
is lost.

2.4.1 Triple Graph Grammars
One of the common methods for implementing bidirectional languages is Triple
Graph Grammars (TGGs), as introduced by Schürr in 1994 [25]. TGGs use graph
grammars, whereby the artifacts to be transformed (source and target) are inter-
preted as graphs, along with an intermediate correspondence graph which links the
source and the target (i.e. defines the relations). Through the use of these graphs,
a language implementing TGGs can then generate the forward and backward trans-
formations, as well as check consistency.

Three bidirectional languages, all based on TGGs and operating on EMF-based
models (as artifacts), are eMoflon [26], MoTE [27], and TGG Interpreter [28]. Their
specification methods are similar and follow a general TGG rule format. Further-
more, they all support model synchronization. However, their implementations dif-
fer. MoTE requires all TGG specifications to be conflict-free [29], meaning that all
relations are required to be bijective. TGG Interpreter, on the other hand, does
not require specifications to be conflict-free, nor does it guarantee that a valid (or

11


2. Theory

complete) model will be produced, as it does not implement any lookahead or back-
tracking; as such, if one wrong decision is taken during the transformation, the trans-
formation will continue down the erroneous path. eMoflon, like TGG Interpreter,
does not either require specifications to be conflict-free. Unlike TGG Interpreter,
however, eMoflon implements local look-ahead, guaranteeing that a valid model, but
not which valid model, is produced.

In a comparison of TGG tools, Greenyer [29] concludes that MoTE should
be used in situations where conflict-free specifications suffice, due to its efficiency.
eMoflon should be used in situations which require look-ahead/backtracking, but
then comes with an overhead that affects performance. The benefit of TGG Inter-
preter is preservation of information during transformations.

2.4.2 QVT Relational
Unlike Triple Graph Grammars, QVT Relational (QVT-r) is not a method for im-
plementing bidirectional transformation, but rather a language specification dictat-
ing how transformations should be specified and the intended outcome of transfor-
mations; how the actual transformations are implemented is not exactly specified.
QVT-r is part of the MOF-based QVT (Query View Transformation) specifica-
tion [30], published by the Object Management Group. QVT-r takes a similar
high-level view of defining transformations as TGGs and Stevens [31] notes that the
QVT core language definition is clearly influenced by TGGs.

According to Stevens [31], the QVT-r standard is somewhat ambivalent to-
wards allowing non-bijective transformations. She argues that the standard can be
misinterpreted and does not make explicit that non-bijective transformations are
allowed. However, she does show that non-bijective transformations are possible to
implement, according to the standard, although the expressiveness of those has its
limitations.

There are several tools, with different underlying implementations, that imple-
ment the QVT-r specification. Medini QVT [32] is perhaps the most mature one.
Operating on EMF-based models, Medini QVT can be run as standalone Java pro-
gram or as an Eclipse plugin (with debug support). Medini QVT does not support
model synchronization, meaning that a new target is generated from scratch during
every transformation. As such, transformations in Medini QVT can be denoted as−→
R : S → T and ←−R : T → S, rather than the more general definition, presented in
Section 2.3, which has both domains S and T as input parameters

Another tool, based on EMF-models, which implements the QVT-r specifica-
tion, is Echo [33]. Echo implements transformations by translating models and
relations into Alloy [34], which contains a model finder and satisfiability (SAT)
problem solver. A SAT problem is set up for Alloy to solve with parameters to
ensure the principle of least change, meaning that a transformed model will be as
close to its original as possible. Therefore, Echo supports model synchronization
through its least-change approach. Notable is also that Echo is able to produce sev-
eral solutions, each further away from the least-possible change. Furthermore, Echo
also takes OCL (Object Constraint Language) constraints into account when finding
appropriate models, meaning that the outcome will always be valid with regards to

12


2. Theory

its metamodel. In comparison, Medini QVT does not take OCL constraints into
account, meaning that it may produce invalid models. However, using a SAT solver
naturally results in a larger performance overhead.

The Eclipse Foundation is working on its own implementation of QVT-r, in-
cluded in Eclipse QVTd[35]. While a rudimentary implementation is already re-
leased, the full version is planned to be released in July, 2017.

Another tool which does not implement the QVT-r standard, but has a most
similar syntax to that of QVT-R and works in a similar manner, is Janus Trans-
formation Language (JTL) [36]. JTL works on EMF-based models and addresses
change propagation through means of Answer Set Programming; when a change is
detected in the source, a constraint (SAT) solver is used to find the ideal solution
for a consistent target. If there is more than one solution, JTL is able to return all.
Principally, JTL is comparable to Echo, although their underlying implementations
differ.

2.4.3 Text-based approaches
The approaches presented so far have all been based on models, in the sense that
they operate on models defined by a metamodel. There are also a set of bidirectional
transformation tools which operate on text-based artifacts, e.g. XML (Extensible
Markup Language) documents. As noted by Stevens [21], it might seem that model-
based transformations should be capable of handling XML documents, since models
are defined in XMI (XML Metadata Interchange) files and both formats share many
properties. However, there are differences (e.g. handling of identifiers), which make
these approaches incompatible.

biXid [37] is an XML-to-XML bidirectional transformation language, based on
a programming-by-relation approach of specifying transformations (similar to that
of QVT-R). biXid supports non-bijective relations but will choose a solution non-
deterministically in cases where ambiguity exists.

XSugar [38] is another text-based bidirectional transformation language, able to
transform between XML and a self-defined text format. XSugar requires a context-
free grammar to be specified for the text format to transform to. Given such a
grammar, XSugar is able to map elements of an XML specification to those of the
grammar, thereby bidirectionally transforming text between the two formats. The
limitation of this approach is a lack of support for non-bijective relations.

Another text-based bidirectional transformation language is Boomerang [39].
In comparison to XSugar, Boomerang is generally applicable to self-defined text for-
mats and includes a rich set of features for writing transformations. Transformations
in Boomerang are written as lenses. A lens is a bidirectional program operating be-
tween two domains S and T . A lens requires that T is an abstraction of S, meaning
that all information in T must have corresponding representations in S (but not
the other way around). A lens mapping between the two domains consist of the
following operations [39]:

• Get : S → T

• Put : T → S → S

13


2. Theory

• Create : T → S

which have to obey the following laws:

1. Put (Get s) s = s, meaning that any information lost when transforming an
artifact s to an abstract artifact t, must be restored when transforming back,

2. Get (Put t s) = t and

3. Get (Create t) = t, both meaning that there should be no loss of information
when transforming from the T domain to the S domain and back again.

As such, all transformations obeying the laws of lenses are well-behaved, although
this comes at the cost that the T domain must be an abstraction of S. In Boomerang,
a lens can consist of a composition of other lenses. Boomerang provides a basic
set of lenses for transforming strings. Based on those, user-defined lenses can be
written in an approach that can be described as a mix of specifying grammar and
relations. Furthermore, Boomerang supports non-bijective relations in the sense that
when there is ambiguity (which only happens when transforming from T → S), the
programmer can control how consistency should be restored.

14


3
Related Work

This chapter introduces some research of direct relevancy to our contribution, build-
ing upon the foundations laid out in the previous Theory chapter. Two impor-
tant research areas are summarized: how the technological spaces of grammars and
metamodels can be bridged, and how bidirectional transformations between general-
purpose languages can be performed.

3.1 Bridging Grammars and Metamodels
Grammars and Metamodels are two different methods of expressing structure for in-
formation (code and models, respectively), and they belong to and are treated in two
different technological spaces [40]: grammarware and modelware. Grammars tend
to be used for most general-purpose programming languages. In particular, context-
free grammars (e.g. BNF), along with plain English for inexpressable semantics, are
often used to specify programming language definitions. Equivalently, metamod-
els are used to describe the semantics of models. While a programming language
may theoretically be specified in models, as is the case with some Domain-Specific
Languages (DSL), it is not commonly the case for general purpose programming
languages. To fully utilize strengths of both grammarware and modelware, it would
be beneficial for information specified in one technological space to be available in
the other.

Recalling Seidewitz’s [18] definition of a model, it may seem trivial that a model
can represent code, i.e. the SUS is actual programming code and the interpretation
is that if there exist a language construct in the model, it will also exist correctly
(in various different aspects) in the code. In a similar manner, a metamodel can be
thought of to represent a grammar. While this might seem plausible in theory, the
realization of such a bridge between the two technological spaces presents a handful
of issues.

Alanen and Porres [41] studied the relationship between context-free grammars
and MOF metamodels. In their work, they defined relations between BNF and
MOF-defined metamodels in both directions. The basic relationship is one where
a rule in BNF is converted to a class in the metamodel. They found that an arbi-
trary BNF grammar can be converted to a metamodel using an algorithm operating
on the M2 layer; the resulting metamodel may not be very intelligible, depend-
ing on the grammar. In the opposite direction, however, they found that the set
of metamodels that can be converted to BNF grammars is restricted. Metamod-
els intrinsically contain more information than BNF grammars, which presents the

15


3. Related Work

problem of maintaining that information when transforming back to a metamodel
from a BNF grammar.

Inspired by the work of Alanen and Porres, Wimmer and Kramler [42] pro-
pose a theoretical framework for bridging the two technological spaces. Their work
extends that of Alanen and Porres by defining relations in the M1 layer, meaning
that programs and models can be transformed. Furthermore, Wimmer and Kramler
tries to enhance generated metamodels by a series of transformation rules which
eliminates redundancy. They also introduce the concept of a change model which
can contain annotations that dictate properties of a metamodel derived from the
grammar. These annotations include specifying which attributes should be treated
as references instead of compositions, what identifiers should be used for referencing,
and data type specification for attributes (e.g. string, int, boolean). By applying
the change model, the generated metamodel will be richer in details. Kunert [43]
presents a similar framework to that of Wimmer and Kramler, although instead of
having a change model, Kunert extends EBNF to support specification of annota-
tions directly in the grammar. For example, terminals which should not be included
in the abstract syntax tree, and therefore neither in a metamodel generated from
the grammar, can be marked to be excluded from the generated metamodel. Al-
though the problem of maintaining information when converting from a metamodel
to a BNF-based grammar and back still remains using annotations, the set of sup-
ported metamodels becomes less restrictive, since the grammar contains more shared
information with MOF metamodels.

3.1.1 Xtext
While the frameworks by Wimmer and Kramler [42] as well as Kunert [43] are largely
experimental, there is a more mature framework included in the Eclipse project,
called Xtext, which is able to bridge the grammarware and modelware technological
spaces. Xtext is a framework intended for creating textual DSLs [44]. It shares
similar properties to Kunert’s framework, namely the ability to annotate grammars
in order to describe the generated metamodel. One difference to Kunert’s framework
is that Xtext takes the approach of specifying what should be included, rather than
what should be excluded. That is, Xtext’s grammar syntax, which has a similar style
to that of EBNF, is quite rich in details about what should be included and how
entities should be represented (e.g. reference/containment, collection or singular
objects, data types, names, etc.) in the generated metamodel. The example in
Figure 3.1 shows an example of Xtext’s grammar syntax. Xtext is also able to
generate a grammar based on an already existing metamodel, essentially taking the
metamodel and treating it as a metamodel for the abstract syntax tree. Furthermore,
Xtext includes support for model-to-text transformations in both directions, i.e.
serialization and deserialization of a model, and is able to generate a textual editor
for a language described by a Xtext grammar.

16


3. Related Work

Usage EBNF Xtext
Definition = :

Concatenation , Whitespace
Termination ; ;
Alternation | |
Optional [...] (...)?

Zero or more {...}* (...)*
One or more {...}+ (...)+
Grouping (...) (...)
Exception - !

Terminal String "..." ’...’ or "..."

Enumeration name = "opt1" | "opt2" enum name: value1="opt1"
| value2="opt2"

Simple assignment A = B C A: bAttr=B cAttr=B
Collection assignment A = B+ A: bs+=B+

Table 3.1: A summary of the relationships between notations in EBNF and
Xtext grammars, as found by Yue[47]

Figure 3.1: An illustration of basic Xtext syntax. The figure shows a rule of type
AClassType, with one attribute key and a collection alternativeKeys.

Although Xtext was initially intended for specifying DLSs [44], there are no
theoretical limitations for using it to specify existing general-purpose programming
languages. The limitations are rather those of the underlying parser generator.
Xtext uses ANTLR 3 [45] in order to generate a parser which parses textual input
(i.e. code). ANTLR 3 generates an LL(*) parser [46], as explained in Section 2.1.
One drawback of using LL(*) is the non-tolerance of left-recursive rules - a problem
which needs to be addressed as quite often existing general-purpose programming
languages are defined in BNF-based grammars, which allow left recursion. Another
problem is addressing the lack of meta information available in the existing BNF-
related grammars. Xtext, for example, requires names for all attributes, as well as
specification of cardinality (i.e. singular or list), as examplified by Figure 3.1.

Yue outlines the relationship between notations in EBNF and Xtext in [47].
Table 3.1 summarizes some of these relationships. Additionally, Yue [47] presents
a set of strategies for dealing with patterns allowed in EBNF, such as left recursion
and common factors (e.g. A = B | B C, where B is the common factor), but not in
Xtext due to the underlying parser generator.

Yue [47] does not suggest an automatic process for converting a EBNF gram-

17


3. Related Work

mar to Xtext format; nor does she deal with the lack of information (e.g. references,
data types) which is required to generate a fully fledged Xtext grammar from the
Xtext grammar. While there does not exist a fully automated solution to convert a
complete Xtext grammar today, there exists semi-automated ones. Bergamyr and
Wimmel [48] present such an approach, where they define an Xtext grammar for
EBNF itself, allowing an arbitrary EBNF grammar to be converted into a model.
They then apply model-based transformation techniques in order to translate an
EBNF model into an Xtext model (grammar), based on similar relationships as pre-
sented in Table 3.1. It is unclear if they automatically refactor left recursive rules,
but their article mentions dealing with difficulties related to LL(*)-based parsing,
of which existence of left-recursive rules is a primary problem. Although they are
able to transform many types of entities automatically, cross references still have to
be handled manually. They deal with this in a similar manner as Kunert [43], by
extending their EBNF grammar to allow annotating attributes as references. How-
ever, they deal with refinement of types for cross references in the Xtext grammar
automatically by example, meaning that if given sample code containing cross ref-
erences, a mechanism is able to figure more specific types for the cross references in
the grammar.

All methods of bridging grammars and metamodels mentioned above will most
likely require additional refinement to produce a useful metamodel. The reason for
this is that most grammars are not intended to be used as metamodels and therefore
they contain less information than required to produce a useful metamodel. There-
fore, going from the grammarware technological space to the modelware technolog-
ical space may entail a certain overhead in terms of refining a generated metamodel

3.2 Translation of General-Purpose Languages
The idea of translating between general-purpose computer languages arose during
the 1950s-1960s, when code was highly dependent on the computer type it operated
on, and therefore not very portable [49]. Early comprehensive attempts to solve
this problem involved the construction of an intermediate universal language, UN-
COL [50]; the idea being that every programming language had a translator which
translated to UNCOL, and every computer had a translator which translated UN-
COL into machine code. However, this idea was never fully realized [49]; although
the later-created LLVM is built on a similar idea [51]. In the remainder of this
section, subsequent approaches to primarily bidirectional language transformation
are presented.

3.2.1 Early Attempts at Bidirectional Translation
Early attempts at bidirectional source-to-source translation include the work of Al-
brecht, Krieg-Brückner [52], et al. in which they translate between Ada and Pascal.
Their work translation consists of a chain of sub-translation in the following manner
(illustrated in Figure 3.2): Ada is translated to a subset of the Ada language, termed
AdaP. AdaP is translated to an intermediate representation. The intermediate rep-
resentation is translated to a subset of the Pascal language, termed PascalA (since

18


3. Related Work

PascalA is a sublanguage of Pascal, no transformation from PascalA to Pascal is
required). The opposite order applies for translation in the other direction.

Figure 3.2: An illustration of Albrecht et al.’s [52] approach to bidirectionally
translate between Ada and Pascal Code. The arrows each represent a

transformation. Note that AdaP ⊂ Ada and PascalA ⊂ Pascal.

The sub-languages AdaP and PascalP are defined by Albrecht et al. [52] to
only include language constructs which have bijective relations to the corresponding
language constructs of the other sub-language, i.e. such that there only exists one
one-to-one mapping between a language construct and its respective counterpart.
For example, assignment of data to a variable is possible in both languages and
is done in a similar manner which, if translated, could have a bijective relation.
Therefore, variable assignment is included in the sub-languages AdaP and PascalP.
However, array assignment is only possible in Ada, and not in Pascal. Therefore,
array assignment is not included in the sublanguages. An array assignment can,
however, be expressed as multiple variable assignments each accessing one element
of the array. As such, array assignments in Ada can be translated by first decompos-
ing an array assignment in Ada into multiple variable assignments in AdaP. AdaP is
then translated in the rest of the translation chain. The purpose of the transforma-
tions from Ada and Pascal into their respective sublanguages is therefore to convert
language constructs which have no direct one-to-one mapping to the other language
into those which have.

In order to implement the bidirectional transformation between the two sub-
languages, Albrecht et al. [52] use an intermediate representation to simplify the
transformation. This intermediate representation is the unifying abstract syntax
tree (AST) of both sublanguages, and contain only the necessary information to
express programs in both sublanguages, i.e. unnecessary syntactic and logical in-
formation is removed. In total, three different transformations for each language
(six in total) are used, as depicted in Figure 3.2. It is worth noting that there are
no transformations from a sublanguage to its full version, meaning that once a pro-
gram has been translated, advanced language constructs will have been replaced.
It should also be noted that Ada and Pascal are not fully compatible and therefore
Albrecht et al. [52] do not support the full language specification to be translated.

19


3. Related Work

3.2.2 The Idea of Concepts Common to Code between Lan-
guages

Krieg-Brückner, one of the authors of [52], later generalized the ideas of the Ada-to-
Pascal translation method in [53]. Krieg-Brückner proposes that concepts that are
more general than specific language constructs should be identified when translating
code, first to translate to a subset of the source language (compatible with a greatest-
common-divisor intermediary), and then into the target language. As an example,
an advanced Fortran IF-statement is translated into a basic variant using IF and
GOTO, still in Fortran. The simple IF-and-GOTO construction is then translated
into equivalent code in Pascal. Thus, the advanced IF-statement, that was supported
only by Fortran, could be emulated by simpler code that had a direct counterpart
in Pascal. Krieg-Brückner also proposes that certain concepts might be emulated
in the target language (for example, exceptions in Ada have no direct counterpart
in Pascal). A discussion follows about which abstraction level to choose: should all
higher-level constructs be mapped to IF-and-GOTO constructions, and all types be
mapped to a chunk of raw memory (e.g. a byte array)? Or, should the higher level
concepts be kept as far as possible in the translation architecture? We will treat this
questions in our result section, with some support byWing. Wing [54] introduced the
idea of computational thinking, describing how humans can be trained into thinking
like a computer, abstracting problems with models and then algorithmically solving
them.

3.2.3 Formalizing Bidirectional Translation
Building on the work of Albrecht et al. [52], Yellin [55] developed a method for trans-
lating general-purpose programming languages, based on invertible attribute gram-
mars (AGs) and an intermediate language. AGs, a concept proposed by Knuth [56],
is a formal way of expressing the semantics of a context-free language, by associat-
ing attributes with production rules of a context-free grammar. The value of these
attributes can be set by assignment of other attributes or semantic functions, and
when evaluated, an attribute grammar can, for example, be used in the translation
of code.

Yellin and Mueckstein [57] proposed a method for automatically inverting at-
tribute grammars that specify translations between two languages. Using this
method, a translation (transformation) can be written from one language to an-
other and the translation in the other direction will be automatically generated,
with validity ensured by construct. As such, this method shares many similarities
with simpler bidirectional transformation languages (and perhaps it could even be
considered as one). For an attribute grammar to be invertible, it has to be specified
on what Yellin and Mueckstein [57] refer to as restricted inverse form, which limits
its expressiveness (see [57] for more details). However, this does not seem to have a
high impact when translating most general-purpose programming constructs.

With invertible attribute grammars as the basis for translation between lan-
guages, Yellin [55] proposed a general architecture for source-to-source translation
between two or more programming languages, as illustrated in Figure 3.3. Every

20


3. Related Work

language construct in language A is translated to one language construct in the inter-
mediate language I. Likewise, every language construct in language B is translated
to one language construct in I. Every language construct in I may be translated into
one of several language construct in A and B respectively. One benefit of having
this intermediate language is that adding a new language to the translation process
requires only one additional translation to be written, namely, one from the new
language to the intermediate language (e.g. TC in Figure 3.3), since the inverse
is automatically generated (e.g. T −1

C ). If translating between n languages, only n

transformation would need to be written in total, compared to
(

n
2

)
(one for every

pair of languages) had an intermediate language not been used.

Figure 3.3: An illustration of Yellin’s [55] architecture for bidirectionally translate
between multiple languages. This figure is based on the work of Yellin [55].

The intermediate language I is defined by Yellin as the canonical form, mean-
ing: “1. Every program (in each source language) must be expressible as a program
in the canonical form, and 2. every canonical form program that is the image of
one translation must be expressible as the image of every other translation.” [55].
Yellin then goes on to say that a preferable strategy in order to pick a canonical
form is to select the greatest common divisor of all languages involved in the trans-
lation, meaning that every language construct that has corresponding construct(s)
in all other source languages should be included. This way, as much as possible of
the program structure as possible is kept intact during a translation. Otherwise,
since virtually all general-purpose languages are Turing complete, many language
constructs could be written in the form of another (e.g. a loop could be written
with GOTO statements and thus only GOTO would be needed in the intermediate
language), but this could possibly result in unreadable code after translation.

One benefit of using an intermediate language selected by greatest common
divisor is that the transformation becomes more concept-centric, as described in
Section 3.2.2. Compared to the approach taken by Albrecht et al. [52] consisting
of two layers of translations, every language construct which does not exist in both
languages can be mapped directly to a viable representation in the intermediate lan-
guage. For example, in C, a numeric variable can be incremented in different ways,
e.g. (i) i++, (ii) i += 1, (ii) i = i + 1. However, when translating to a language
which only permits expression assignments (iii), the greatest common divisor is (iii).
This scenario gives rise to a many-to-one mapping, and a one-to-one mapping, as
illustrated in Figure 3.4. Albrecht et al. [52] would deal with this situation by first

21


3. Related Work

transforming alternatives (i) and (ii) to (iii) in a sublanguage of C. Their sublan-
guage representation is then translated to the other language. As such, Yellin’s [55]
approach requires less transformations. It does, however, require that any many-
to-one mapping is managed when translating in the other direction. For example,
which alternative (i), (ii), or (iii) should be chosen when transforming from the other
language back to C? Yellin solves this by statically selecting one alternative as the
only alternative and marking the other production rules in the attribute grammar
as non-invertible (illustrated by the direction of the transformation of alternative (i)
and (ii) in Figure 3.4).

Figure 3.4: An example illustrating a translation scenario with multiple ways to
express an increment statement in one language (left), and only one in another

language (right). The GCD of these languages is i = i + 1 (middle).

Yellin [55] addresses an issue with defining the intermediate language based
on greatest common divisor, namely that if another source language with a lower
level of abstraction is added, the entire intermediate language might need to be re-
defined. As such, he discusses another form, least common multiple, which entails
that all high-level constructs found in any of all source languages are included in
the intermediate language. The burden, instead, is then to write translations which
can describe every high-level construct not found in a language, based on the low-
level constructs that exist. Furthermore, the intermediate language may become
incomprehensible and hard to grasp.

22


4
Method

As a general discipline, design research was used as a research methodology due to
the specific circumstances of our work; specifically solving the real-world problems
identified in section 1.1 based on theory and real-world context, and refining existing
design disciplines. In order to answer all research questions, results were reached
by iteratively solving smaller problems and combining them. The result chapters
outline the main problems explored:

• how COBOL is used in practice,

• which popular language is suitable for translation together with COBOL,

• how such a translation can be performed bidirectionally,

• and what strategies are needed for parts not directly translatable.
Systematically, we followed the guidelines on design research defined by Hevner et.
al. [58] (refer to Table 1 in their paper):
Guideline 1: Design as an Artifact Our work produced a viable artifact in the

form of a working source-to-source compiler prototype.

Guideline 2: Problem Relevance Our work is of relevancy to the industry, as
motivated in the Introduction chapter.

Guideline 3: Design Evaluation The artifact was evaluated, as described later
in this chapter, on its utility, quality, and efficacy.

Guideline 4: Research Contributions We present a research contribution in
the area of bidirectional transformations on general-purpose programming lan-
guages, using higher-level concepts (as presented as results in Chapter 8).

Guideline 5: Research Rigor The results were reached and evaluated by scien-
tifically applying the methods described later in this chapter.

Guideline 6: Design as a Search Process We extensively used and combined
available artifacts, in the form of already developed software tools. Where
no suitable software artifact was available in order to solve a problem, we
developed one.

Guideline 7: Communication of Research This thesis serves to introduce and
develop the field. While is is mostly technically oriented, there are some parts
of relevancy to management. Specifically, the importance and implications to
the industry have already been described in the Introduction.

23


4. Method

4.1 Survey of Cobol Language Construct Frequency
In order to survey how COBOL is used in practice, a tool was developed that
can count occurrences of the different language constructs available to COBOL
programmers in actual program code. The formal grammar that the open-source
compiler GnuCOBOL [6] (formerly OpenCOBOL) uses when parsing programs for
compilation, was used as the baseline for what language constructs the tool can
recognize. For analysis, the full source code of the open-source program Applewood
Computing Accounting System [59] was used. The analysis results ranked usage
of language constructs, by occurrence, in the analyzed code base. The results are
presented in chapter 5.

4.2 Choosing a Target General-Purpose Language
The task of choosing a language to transform COBOL between is many-faceted.
First, a requirement has been that the language is popular and general-purpose in
nature. Next, different languages exhibit different properties, making them compat-
ible to different degrees. This section describes how we chose the best candidate for
our requirements. The results are presented in chapter 6.

First, a list of popular general-purpose languages was compiled, after which a
comparison of these languages and COBOL was performed. In the comparison, we
chose different criteria to classify the languages. Languages were eliminated from the
comparison early on when it was clear they were not suitable for use together with
COBOL. These criteria were chosen, according to our best judgment of relevancy,
to be the following:

• Static or dynamic typing

• Memory management (allocation on stack, heap, and garbage collection)

• Primitive types

• Classes and objects

• Functions

• Basic syntax (arithmetics, conditionals, branching, loops)

• Pre-processing and meta-programming (macros, templates, overloads)

4.3 Emulating COBOL Data Types in C++
As is perhaps already clear from the heading, C++ was ultimately chosen as an
optimal target language. It was also established (in the results, see chapter 7) that
some data types cannot be translated directly from COBOL to C++, as there is no
direct counterpart. These types can, however, be emulated in code by user-defined
data types. A proof-of-concept standard library, providing emulation for all COBOL
data types not already supported by C++, was developed with the following goals:

24


4. Method

• Bit-perfect storage of data (mirroring the COBOL format).

• Integer formats:

– Plus operator overloading (showing arithmetic support).
– Casting operator overloading between the emulated types and native

types.
– Arithmetic overflow callbacks.

• Strings:

– Proper formatting of string data types, as COBOL formats them (arbi-
trary COBOL PICTUREs).

• Compound data definition:

– COBOL RECORD emulation.
– COBOL REDEFINES RECORD emulation.

Each method/overload on the developed types was tested when implemented, show-
ing that the developed features work. The corresponding result chapter is chapter 7.

4.4 Creating a Model-Driven Source-to-Source Com-
piler

The general method used for translation in the developed source-to-source com-
piler can be described in 4 steps: a text-to-model transformation from the source
language, a model-to-model transformations to an intermediate model, a model-to-
model transformation from the intermediate model, and a model-to-text transfor-
mation to the target language. This process is illustrated as an activity diagram in
Figure 4.1.

Figure 4.1: An activity diagram showing the transformation flow, including each
step involved in the translation process.

25


4. Method

In order to implement the model-to-model transformations, the bidirectional
transformation language QVT-R was used. The choice of using a bidirectional trans-
formation language, compared to specifying pairs of unidirectional transformations,
is motivated by gaining transformation validity by construct. This is a means to
partly answer RQ3a and RQ3b. The motivation for choosing QVT-R over the other
options (presented in Section 2.4) is threefold:

1. QVT-R is an established language standard, meaning that code written could
be run in several tools (interpreters), even eventual future tools which may
become more mature over time. In contrast, tools implementing TGGs might
have slightly different syntax and setup, and all text-based transformation
languages have different syntax.

2. There are tools available that implement model synchronization and handle
non-bijective relations. These tools are almost exclusively built on top of
Ecore, meaning that any models used will be compatible with any other Ecore-
based tools (for example, analysis and diagram tools to improve quality and
understating). In short, many opportunities in the MDE field will become
available.

3. QVT-R provides the option to specify OCL expressions for handling complex
relations and conditions. To our understanding, text-based languages (such as
Boomerang) and TGG-based tools are not as flexible in this aspect.

Furthermore, a reason for not selecting a text-based transformation language is
uncertainties about whether a C++ and COBOL grammar can be fully specified in
such a language.

The remainder of this section presents the steps taken when implementing the
prototype source-to-source compiler. Work was guided by small iterations where
each of these steps was to some extent included. The outcome of each iteration
determined decisions and scope of the next.

4.4.1 Creation of an Xtext COBOL Grammar
Since QVT-R is a model-based transformation language, COBOL and C++ code
must be transformed to models in order to apply transformations. To bridge the
grammarware and modelware technological spaces, Xtext was used. That is, an
Xtext grammar for COBOL was defined and Xtext was used to generate the corre-
sponding metamodel, enabling text-to-model transformations for COBOL code.

The focus of our work is primarily on translation. As such, we chose to reuse
and adopt an already existing grammar for COBOL. The grammar chosen [60]
was created by Lämmel and Verhoef in [61] and reflects the IBM’s VS COBOL II
Reference Summary. The grammar is specified on EBNF format and has been tested
on over 2 million lines of COBOL code [61].

Lämmel and Verhoef’s [61] COBOL grammar was converted to an Xtext gram-
mar using Yue’s [47] approaches and rules for conversion between EBNF and Xtext
(See Table 3.1). In addition to these rules, another rule one was required: the symbol
‘||’ in Lämmel and Verhoef’s grammar denotes any permutation of two rules, which
is represented as an unordered group, i.e. the ‘&’ symbol, in an Xtext grammar.

26


4. Method

4.4.2 Creation of an Intermediate Model
Like Yellin [55], we used an intermediate representation (model) to implement trans-
formations between the two source languages. This intermediate model was defined
to be the canonical form of the two languages. Unlike Yellin, however, we present
our own approach, based on common concepts between programming languages, for
defining the canonical form, in order to answer RQ3d. This approach, together with
the resulting model, is presented as a result in Section 8.3.

4.4.3 Specifying Transformations in QVT-R using Echo
The QVT-R transformations between each of the source models were originally
intended to be run by Echo. The motivation for this was threefold:

1. Echo provides support for model synchronization through its least-change prin-
ciple. In practice, this would mean that any changes made to the target model
would propagate back and result in a minimally changed source model. As
such, RQ3c would, in theory, be answered, since the least-change principle
would guarantee only local changes.

2. Echo’s least-change principle would guarantee that that any non-transformable
information excluded in the transformation would remain intact after a back-
and-forth transformation.

3. Echo provides support for generating all possible solutions when synchronizing
models, meaning that a user or an automated mechanism could pick the most
appropriate solution.

Using Echo v0.3.1, Eclipse Modeling Tools v4.6.2, Eclipse QVTd v0.11, and
Eclipse OCL v4.2, we wrote the initial transformations in QVT-R. However, we
encountered numerous problems with Echo. First, we discovered that Echo does not
support certain Ecore data types present in our metamodels generated by Xtext.
Second, when running transformations, several scenarios produced Java exceptions,
which required us to debug Echo’s source code, and in some cases modify it. Finally,
we could not make Echo obey containment rules in metamodels, i.e. Echo treated a
containment (composition) as a regular reference. As a consequence, the solutions
produced by Echo were invalid. The solutions produced could become massive; that
is, Echo could produce solutions which contained classes that were not connected
to anything else in the model. As such, when the model grew in size, the SAT
problems solved by Echo grew even more in size (since they included all invalid
solutions) and took too long to solve. As an example, synchronizing two simple
models with 5 classes each, where each class had a small set of attributes and
relations to other classes in the same model, could produce a SAT problem with
over 2 million variables, taking over 30 minutes to solve. Therefore, we made the
decision to abandon Echo.

27


4. Method

4.4.4 Specifying Transformations in Medini QVT
Due to the problems faced with Echo, we used Medini QVT to run QVT-R transfor-
mations instead. The transformations already written could directly be reused. Due
to limited developer literature for Medini and QVT-R in general, we faced several
scenarios which were non-trivial to solve. We devised approaches for these scenarios,
and together with the resulting transformation specifications, they are described as
results in Section 8.4.

A drawback of using Medini QVT is that it does not support model synchro-
nization, meaning that RQ3c will not be fulfilled when running the transformations
in Medini.

4.4.5 Creation of an Xtext C++ Grammar
Similarly to the approach of defining an Xtext COBOL grammar, an Xtext grammar
for C++ was defined. Lämmel and Verhoef’s [61]’s COBOL grammar does not
support object-oriented COBOL. As such, we limited our source-to-source compiler
to only support the sequential part of C++ (i.e. C-like part), with the exception
from a few classes supplied by us to emulate COBOL data types. We found a
C grammar defined by Terence Parr [62], implemented in ANTLR3, which was
converted to an Xtext grammar. Unlike the COBOL grammar, this C grammar has
not been rigorously tested. However, it sufficed to represent the sequential part of
C++ which we required. Due to the fact that this C grammar was implemented
in ANTLR3, it was trivial to convert to Xtext (since Xtext is built on ANTLR3).
Therefore, using this C grammar was deemed preferable to converting a C++ EBNF
grammar to Xtext.

4.4.6 Evaluating Results
Due to the limited size of our prototype, we were able to perform manual tests
on different combinations of each language constructs to evaluate our approach.
Representative examples, based on code from the Applewood Computing Accounting
System, were devised and used for evaluating mainly 3 different aspects:

1. Correctness - whether the translation is valid, in that the same output is
produced in a piece of translated code as in the original code. Deeply nested
constructs are of interest, due to their non-trivial nature.

2. Intent preservation - how the intent of a certain piece of code is preserved after
a translation.

3. Construct preservation - how well non-changed constructs are left intact after
a back-and-forth translation including and excluding changes to the code base.

Note that unmodified extracts of code from Applewood Computing Accounting Sys-
tem [59] could not be used due to the limited range of language constructs supported
by our prototype. Instead, examples containing as much supported code as possible
were gathered and the unsupported language constructs were removed.

28


4. Method

In evaluating and discussing validity of transformations, we differentiated be-
tween two types of validity: transformation validity and specification validity which
are defined as follows:

Definition 4.4.1. Transformation validity is regarded as guaranteeing that trans-
formations produce a valid result, in both directions, according to specification.

Definition 4.4.2. Specification validity is regarded as guaranteeing that relations
are specified in a manner such that constructs in one language correctly relates to
those in the other language, i.e. the meaning is correct.

Since transformation validity is largely guaranteed by construct when using a bidi-
rectional transformation language, the main focus the evaluation was specification
validity.

29


4. Method

30


5
Survey of Cobol Language

Construct Frequency

This chapter presents the tool developed in order to analyze COBOL language
constructs’ occurrence frequency, and the results obtained by running it against
a sample project. Performing the analysis is interesting as the results can be used
to scope the work in the remaining result chapters. The validity of generalizing the
results are discussed at the end of this chapter.

5.1 The Developed Analysis Tool
In order to measure occurrences of certain COBOL language constructs in a certain
code base, the compiler GnuCOBOL [6] (formerly OpenCOBOL) was modified, by
adding counting code to the parser generator specification (a formal grammar and
AST-construction code is already specified together for use by the parser generators
yacc or Bison in the existing source). A total of 429 different grammar rules, that we
deemed relevant to distinguish between, are distinctly recognized. This approach
allows every program that can be compiled with GnuCOBOL to be analyzed by
simply compiling them, using pre-existing build scripts of the examined project.

5.2 Analysis Results
The most commonly used language constructs used in Applewood Computing Ac-
counting System [59] are listed in Table 5.1. In summary, basic constructs such as
IF, ADD, MOVE etc. are used in this COBOL source too. The most used constructs
were record declarations and PICTURE clauses, similar to struct and variable decla-
rations. VALUE IS and USAGE IS are related to these variable definitions. We also
saw that the DISPLAY and ACCEPT commands, responsible for console output
and input, were important.

31


5.
Survey

ofC
obolLanguage

C
onstruct

Frequency

Table 5.1: Summary of the 50 most commonly used language constructs in the COBOL program Applewood Computing
Accounting System.

Language construct Occurrences Language construct Occurrences
RECORD declaration 41660 PICTURE declaration 30980
VALUE IS (on PICTURE) 17935 MOVE 11919
CONDITION (88 level) 7934 IF 5287
DISPLAY 4022 CONSTANT ENTRY (78 level) 3301
(DISPLAY) WITH FOREGROUND-COLOR 3288 GOTO 3187
PARAGRAPH declaration 2700 USAGE IS BINARY-LONG SIGNED 2671
PERFORM _ 1699 Screen declaration 1581
Screen opt COLUMN 1528 REDEFINES (RECORD) 1417
USAGE IS BINARY-CHAR SIGNED 1233 USAGE IS BINARY-SHORT SIGNED 1215
ADD _ TO _ 1153 USAGE IS COMP 1062
WRITE 1050 ELSE 1036
Screen opt LINE 849 (WRITE) BEFORE/AFTER ADVANCING lines 844
(RELEASE/WRITE/REWRITE) FROM 814 Section header declaration 801
USAGE IS COMP-3 793 ACCEPT 776
OCCURS 719 EXIT SECTION 645
(EVALUATE) WHEN 570 (STRING) DELIMITED BY 553
(accept opt) FOREGROUND-COLOR 538 CALL 510
(CALL/ENTRY) USING 509 (accept opt) UPDATE/DEFAULT 435
USAGE IS BINARY-CHAR UNSIGNED 419 CLOSE 416
(STRING/UNSTRING) WITH POINTER 363 Screen opt USING 351
STRING 343 READ 335
(DISPLAY) WITH HIGHLIGHT 323 (DISPLAY) WITH ERASE EOL 311
FILE-CONTROL ASSIGN 296 FILE-CONTROL SELECT 296
FD (file type) 285 (DISPLAY) WITH ERASE EOS 265
SUBTRACT _ FROM _ 262 Screen opt FOREGROUND-COLOR 236

32


5. Survey of Cobol Language Construct Frequency

5.3 Limitations and Validity of Generalization
A limitation of the developed tool is that it only supports analysis of COBOL
programs compilable by GnuCOBOL (including pre-processing). As we only tested
the tool on one program, we cannot be certain the results obtained are generalizable.
However, we have no indications that they should not generalizable, except that the
system under test is a console program, hinting that DISPLAY and ACCEPT might
be used differently in more integrated programs.

33


5. Survey of Cobol Language Construct Frequency

34


6
Choosing a Target

General-Purpose Language

This chapter explains why we chose C++ as an optimal target language. First, a
list of popular languages (relating back to RQ1 that the language should be well-
known) is obtained. Second, a comparison between these languages is given. Last,
a discussion follows, concluding that only a subset of C++ (a hybrid between C and
C++) will be supported by our work, due to time limitations.

6.1 The Comparison
The software quality company Tiobe compiles a list of languages each month, rank-
ing them by popularity [3]. At the time of writing, the most recent one is Tiobe
Index for April 2017, which places the following languages, in order, as the most
popular ones: Java, C, C++, C#, Python, PHP, VB.Net, and JavaScript. These
languages will be used in the comparison. The headers below match the list of
criteria introduced in the Method section.

6.1.1 Static or Dynamic Typing
The selected languages can be classified according to the two paradigms of loosely
typed scripting languages (Python, PHP, VB.NET, and JavaScript), and typi-
cally compiled statically type-checked languages (Java, C#, C, C++, along with
COBOL). Due to this difference, we decided to remove the scripting languages from
the remainder of the comparison and focus on Java, C#, C, and C++.

For sources supporting the claims made for each respective language, refer to the
language specifications for Java [63], C [64], C++ [65], C# [66], and COBOL [67].

6.1.2 Memory Management and Environment
One important difference between Java and C# on one side, and C, C++, and
COBOL on the other, is how memory is managed. In COBOL, from what we saw in
the Language Construct Survey, most memory is allocated statically at the beginning
of a program, and is available there until program execution halts. On function calls,
memory may also reside on the stack for parameters and return values. Even though

35


6. Choosing a Target General-Purpose Language

we did not see it in actual code, it is possible to allocate memory on the heap [68].
Such heap memory must be returned manually to the system. Both C and C++
are similar in this regard. Java, on the other hand, only allows primitive data types
to be allocated statically and stored on the stack. More complex objects need to
be heap-allocated. Additionally, memory is managed through a garbage collector.
C# makes the distinction in the language by heap-allocated and garbage collected
classes, and stack-allocated structs.

6.1.3 Primitive Types
The primitive types in C#, C, and C++ allow for storing numbers in signed and
unsigned form, compared to Java, which only allows signed numbers (except for
char, which is 2 byte unsigned). COBOL has a complex type system with pictures
(PICs) that store numbers or strings. The numbers can be signed or unsigned, and
contain a decimal, which might be implied (not present in memory). The memory
representation can be binary (the same as in C/C++/C#/Java), or text-based, so
that the number can be printed directly as a string. Additionally, it can be stored
as a packed form of the text-based format. All languages support 32 and 64-bit
floating precision numbers (typically IEEE-floats).

Unlike the other languages, COBOL strings encode information about format-
ting, which can be arbitrarily specified by the programmer. A string, for example,
might contain three letters followed by three digits, separated by an implied decimal
(i.e. a comma not stored in memory at runtime, but still included when formatting
the string). This information is stored in the type itself. When outputting these
strings to the console, using the DISPLAY command, they are formatted with for ex-
ample implied decimal printed and leading zeroes converted to spaces [69]. Also, as
already hinted at in the COBOL language construct frequency survey, the DISPLAY
command can manipulate the console appearance, something not natively supported
in the other languages. For example, a majority of all DISPLAY commands in the
analyzed code had the option to display WITH FOREGROUND COLOR. These
features should be possible to emulate in all the other languages by helper functions.

6.1.4 Classes and Objects
COBOL has the notion of records, where primitive data types and other sub-records
are stored together, like a struct in C, C++, or C#. A class in C++, C# and Java
is also more or less similar. Of important note are REDEFINES records, where
a sub-record aliases another in memory, implemented in C# as field offsets, and
similar to a union in C or C++. In Java, such functionality is not directly available
to the developer. While C and C++ do not guarantee memory storage location, i.e.
where in a struct a member is stored, or using unions to read aliased memory, it is
so common to do so anyway that it almost is an informal standard according to our
experience. COBOL also supports classes since a recent version, but we have not
investigated the matter further, as the language construct survey revealed they were
not used in practice (it can also be argued that they are of less practical interest, as
the vast majority of existing COBOL code was written before they were introduced

36


6. Choosing a Target General-Purpose Language

into the language).

6.1.5 Functions
COBOL, C, and C++ allow something similar to a function to be called. Argu-
ments are passed on the stack and returned there (or via processor registers, at the
implementation’s discretion). Similarly, Java and C# allow methods to be called
on objects, even though separate functions are not allowed. Of note again is that
whole objects cannot be passed on the stack in Java - a reference to a heap-allocated
object must be used. C# and C++ allow operator overloading, where for example
arithmetic operators such as ’+’ and ’*’ are mapped to user-defined functions. A
COBOL function (a paragraph in COBOL terminology) may not call itself recur-
sively, in contrast to its C, C++, C# and Java counterparts.

6.1.6 Basic Syntax
The syntax of basic operations and control flow are very similar in C, C#, C++ and
Java. The concepts are similar in COBOL too, but COBOL tends to use separate
statements for basic arithmetics (ADD, ADD-TO, MULTIPLY etc.) and a separate
COMPUTE-statement for more complex arithmetic expressions. In C, C++, C#
and Java, everything tends to be an expression, regardless of how complex it is.
One important difference in COBOL compared to the other languages is that it can
handle integer arithmetic overflow by natively controlling program flow using simple
branching (like an if statement). Another important difference is that data can flow
from several sources to several targets, compared to C/C++/C#/Java which is more
single-data load-store oriented. An example of this is the ADD-statement, which
allows several sources to be added together and then stored at several destinations,
all in one statement.

6.1.7 Pre-Processing and Meta-Programming
C, C++, and COBOL have pre-processors that allow simple macro expansion. C++
provides advanced meta-programming features through templates, allowing code
run at compile-time to decide how a type or function should be constructed. Java
and C# have simple template-like semantics implemented as generics, mimicking
C++ templates in the sense that a type may take another type as an argument to
construct the final instantiation at compile-time. C++ templates, and C++’s ability
to reference memory byte-by-byte and bit-by-bit, allows all COBOL data types
to be emulated in a bit-perfect manner. Considering that there are theoretically
an infinite number of data types in COBOL (as they are all custom), this is an
important result. Furthermore, operator overloading in C++ can provide native
arithmetic and casting support for these types. Paired together with structs and
unions, COBOL records can be emulated directly, as the memory can be forced to
align perfectly. As previously mentioned, these custom types can be stored statically,
on the stack as function parameters, and be passed as function return values in C++.
Finally, the COBOL-equivalent of an array (a table) and its access semantics, can
be similarly translated using a C++ template implementation.

37


6. Choosing a Target General-Purpose Language

6.2 Discussion
Given that COBOL data types can be constructed in C++, it seems to be the best
choice given our constraints. It is also evident how Java’s memory model might
interfere with programming, as it is constrained to garbage-collected references in
many cases. This might require developers to write code in a very specific manner,
not very like how they are used to, making the popularity of the language less
relevant. Writing the code to fit Java-to-COBOL translation might also interfere
with standard Java compile-time checks about what is allowed or not on regular
static types. These reasons might defy the purpose of having a transformation
altogether.

Comparing C and C++, the latter is more or less a version of the former
with new features. In order to fit the scope of the thesis, we chose to support a
special variant of C++ that allows us to use the C++ features for implementing
our standard library supporting COBOL data type emulation. User code is, on
the other hand, restricted to what transformations are supported (see chapter 8),
more like the basic C syntax. This restriction simplifies translation, as templates
and overloaded operators need not be supported in the transformations. As already
mentioned, classes are not supported either, due to scoping of our work.

38


7
Emulating COBOL Data Types in

C++

This chapter describes the standard library we developed for emulating COBOL
data types in C++. The results will be used in chapter 8 when answering research
question RQ3, for when a COBOL data type that have no native C++ counterpart
is translated.

The standard library was realized with C++ templates and structs with opera-
tor overloads in these. Three different data types were realized: one for the default
decimal numeric type (based on EBCDIC-encoded strings), one for the computa-
tional field type packed decimal, and one for regular strings. We concluded that
other types of so-called computational fields could be translated to regular integral
types and floating point numbers. COBOL records can be represented with C++
structs, and REDEFINES records with unions, as the developed data types are
stored exactly as their equivalent COBOL counterparts, bit by bit. Furthermore,
COBOL tables may be represented as C++ arrays, and loop indexing by the native
memory indexing type std::size_t. The COBOL type FILLER (that occupies
dummy memory) is trivially represented by an empty array of bytes. Using these
conversions allows every data type representable in COBOL to also be representable
in C++.

7.1 Enhanced Byte Arrays
A native array of raw bytes (chars) in C++ cannot be directly passed by-value
on the stack [65]. This limitation was circumvented by defining a templatized
struct, ByteArray<L>, containing an array of the given size, as in Listing 1. The
ByteArray<L> could then be used to have full control over the memory when defin-
ing the emulation types. It should be noted that the compiler is allowed to lay out
the memory as it wants [65], but we rely on that it will order fields in the order they
are defined in a struct (in our experience, this is the common case). Furthermore,
we rely on that the compiler does not try to align the memory addresses (e.g. on
32-bit or 64-bit boundaries), a feature that might be needed to be turned off in some
implementations.

39


7. Emulating COBOL Data Types in C++

typedef unsigned char byte;

template<std::size_t L> struct ByteArray {
private:

byte data[L];

public:
ByteArray() : data() {}

byte& operator[](std::size_t idx) {
return data[idx];

}

const byte operator[](std::size_t idx) const {
return data[idx];

}
};

Listing 1: An array of raw bytes that can be passed by-value on the stack.

7.2 The Default Decmial Type
The default COBOL data type allows numbers to be stored in the EBCDIC format
(similar to an ASCII string, but another encoding) [67]. The four least significant
bits contain the number (0-9), and the four most significant ones are normally set
high (1111). The data might be signed, where the sign mask replaces the four
most significant bits on either the first or the last byte; which alternative is used
can be configured by code when declaring the data type. Finally, the number may
contain decimals, and the decimal sign (a dot) may be implied, meaning that it is
not explicitly stored as a byte in the data. Implied decimal is the default. The
byte-width of the integer and decimal part is configured in code when declaring the
data type.

The emulation type was realized as a templatized struct, shown in Listing 2,
which allows all mentioned parameters to be specified into the compile-time type.
Of important note is how the expression defining the byte width can be resolved
at compile-time and baked into the instantiated type. Overloaded operators on
the type, allowing addition and add-assign as examples of arithmetic operators, as
well as casting to strings and doubles, are also of interest. To enhance readability,
the constructors that instantiate specific instances to zero, a given default value,
or a copy of the packed decimal type discussed below (its internals are accessed by
making it a friend), are not shown.

40


7. Emulating COBOL Data Types in C++

7.3 Packed Decimal
COBOL also allows packing a default decimal number into almost half the bit-size,
by removing the four most significant bits, thus storing two digits into the same
byte [67]. Given that the resulting type cannot be directly printed, the decimal
character is always implied, and thus not stored. In order to store the sign, some
extra four bits are required, at the least significant position. For an odd number
of digits, the sign bits can be baked into the type. For an even number of digits,
the four most significant bits on the first byte are set to zero, and the four least
significant bits on the last byte contain the sign. It also happens to be the case that
four sign bits are reserved even for unsigned numbers, in which case they are set to
all ones.

The packed decimal emulation type (see Listing 3) was realized similarly to the
unpacked version. Its internal implementation temporarily converts to the unpacked
version for calculations, and then back (optimizing for code simplicity over perfor-
mance). While the cast overloading technically can handle all cases, C++ does not
automatically use it e.g. when adding a packed decimal to another packed decimal.
Therefore, the operators are overloaded in this type too. When adding a packed
decimal to the default unpacked version, casting is performed and the arithmetic
operation is handled directly in the default type.

7.4 Strings
COBOL also allows its data types to take an arbitrary form of text and numbers,
possibly formatted in a certain way [67] (see chapter 6 for details). Thus, the
formatting is part of the type information. As a data type that contains text cannot
be used directly in numerical calculations (without converting it first, e.g. by using
REDEFINES records and overlaying a decimal type padded by FILLER), they are
represented by a separate string type. The multi-byte string type Double-Byte
Character Set (DBCS) is not explicitly supported by our implementation. However,
the implementation should be able to handle such data transparently, but there is
no way of specifying literals of such type in C++, and we have not developed any
value-converter as a workaround.

The string type was realized using a template that poses as a tuple, allowing dif-
ferent pieces of the string with different formatting to be concatenated together. For
example, the code PIC<X<5>, C, Z<3>, V, N<2>> WS_HELLO("Hello,00314"); de-
fines:

• five characters of any type (X),

• followed by a comma (C),

• followed by three digits, where leading zeroes should be replaced with space
when formatting the string (Z),

• followed by an implied decimal that should not be stored in memory, but still
printed when formatting the string for output (V),

41


7. Emulating COBOL Data Types in C++

• and finally followed by two regular digits (no special formatting of leading
digits) (N).

Listing 4 and 5 shows how the types were implemented in more detail. The tuple
is recursively defined and accessed using tail recursion, with inspiration from the
method presented by [70], but using types instead of literals. Note how each type
(X, C, Z, V, N etc.) has a cast-to-string operator overload and occupies as much
memory as needed. Note also how the V option is only part of the compile-time
type, as it has no members and thus occupies no memory at run time.

7.5 Summary
We have presented one way to represent COBOL data types in C++. Relating
back to the feature goals in the Method section, trivial tests shown that all listed
features work. The storage is bit-perfectly stored in byte arrays. Overloads on
the encapsulating integer types allow simple arithmetic operations (with overflow
detection) and casting between formats. String types can be formatted properly.
COBOL RECORDS and REDEFINES can be handled by native language constructs
(C++ struct and union), given some assumptions on the compiler, notably outside
the language specification but as commonly implemented by vendors. The results
obtained can thus be used when answering research question RQ3.

42


7. Emulating COBOL Data Types in C++

template<std: