Mining Relations from Git Repositories
Applying Relation Extraction Technology to Git Commit Messages

Master of Science Thesis in Software Engineering

Rikard Andersson

Chalmers University of Technology
University of Gothenburg
Department of Computer Science and Engineering
Göteborg, Sweden, August 2014


The Author grants to Chalmers University of Technology and University of Gothen-
burg the non-exclusive right to publish the Work electronically and in a non-commercial
purpose make it accessible on the Internet. The Author warrants that he/she is the
author to the Work, and warrants that the Work does not contain text, pictures or other
material that violates copyright law.

The Author shall, when transferring the rights of the Work to a third party (for
example a publisher or a company), acknowledge the third party about this agreement.
If the Author has signed a copyright agreement with a third party regarding the Work,
the Author warrants hereby that he/she has obtained any necessary permission from
this third party to let Chalmers University of Technology and University of Gothenburg
store the Work electronically and make it accessible on the Internet.

Mining relations from git repositories
Applying relation extraction technology to git commit messages

Rikard Andersson

c©Rikard Andersson, August 2014.

Supervisor: Morgan Ericsson

Examiner: Matthias Tichy

Chalmers University of Technology
University of Gothenburg
Department of Computer Science and Engineering
SE-412 96 Göteborg
Sweden
Telephone + 46 (0) 31 - 772 1000

Department of Computer Science and Engineering
Göteborg, Sweden August 2014


Abstract

Text data can contain valuable information that is unavailable at a larger scale due
to the unstructured nature of free text. Git repositories and Git commit messages within
them are one such category of unstructured text data. Relation Extraction (RE) has en-
joyed success as a solution to similar problems for a more generic case but also for more
specialized domains such as life sciences. RE does however, remain largely untested for
text data from Git repositories. This thesis contributes to RE and Software Engineer-
ing research by testing RE solutions developed for the generic problem on the domain
specific problem of Git commit messages. An experiment is conducted where a custom-
made relation extractor is tested on hand annotated Git commit messages drawn from
popular public projects on GitHub. The results show that common RE solutions and
their models cannot be directly applied to data from Git commit messages due to a very
domain specfic language in which these messages are expressed. This prompts for future
efforts into developing domain specific tools and models.

Keywords: Relation Extraction, Natural Language Processing, Git commit messages,
GitHub


Acknowledgements

I would like to thank the people at Findwise for letting me use their splendid facilities,
for giving me insight into their work and for welcoming me as one of the group. Special
thanks to Svetoslav Marinov who has been very generous with both time and expertise
whenever I have needed it. A lot of the ideas in the relation extractor described in this
thesis is his contributions as well as relation extraction as a subject for thesis.

I would also like to thank my supervisor at Chalmers, Morgan Ericsson, who have
provided knowledge and guidance through though passages. Moreover, the whole idea
with experiments on Git commit messages was originally his.

Rikard Andersson, Göteborg, 8 August 2014


Contents

1 Introduction 1

2 Related work 3

3 Background 5
3.1 Relation extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Candidate identification . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 Binary task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.4 Triple task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 The relation extractor Relex . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Relation candidate identification . . . . . . . . . . . . . . . . . . . 10
3.2.3 Binary task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.4 Triple task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Methodology 12
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Binary task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Triple task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Results 19
5.1 Binary task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Entity relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Author relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 Triple task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

i


CONTENTS

5.3 Qualitative results of pre-processing . . . . . . . . . . . . . . . . . . . . . 22
5.3.1 Sentence detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3.2 Part-of-speech tagger . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Discussion 25
6.1 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1.2 Part-of-speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1.3 Recurring patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.4 Language use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Overall results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Threats to validity 32
7.1 Conclusion validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2 Internal validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.3 External validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.4 Construct validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8 Conclusion 34

Bibliography 40

A List of entity types 41

B Annotation guidelines 43

C ConLL 2011 shared task data example 44

D Relex binary task performance on ConLL data 46

E Relex triple task performance on ConLL data 48

F List of projects in GHTorrent 49

ii


1
Introduction

Large quantities of information around us is expressed in the form of natural language in
free text. Popular examples include sources from the internet such as social media, blogs,
news articles, published papers, etc. These data sources contain potentially valuable
information. The information is however difficult to harvest on a large scale due to the
unstructured nature of the data format. This calls for technological solutions. Such
solutions would include methods, tools, and systems to automatically structure and
aggregate it. Common techniques to use for these purposes reside in research areas such
as information retrieval, data mining, and Natural Language Processing (NLP). The
latter of these, NLP, is specifically aimed at processing text data, structuring it and
building models from it.

Within a completely different branch of science, Software Engineering (SE), re-
searchers are trying to understand software development. One way of doing this is
by analysing data fetched from large software repositories. Such repositories primarily
contain artifacts directly related to the end product of software development. However,
they also contain metadata about the development process some of which is produced
by the developers themselves mainly for collaborative purposes. This data is created
as chunks of free text which describe the development process and is meant to aid the
retention of a rich history of changes. This is done in components such as commit mes-
sages, issues, and bug reports. Said data could certainly be of analytical interest to
developers, decision makers, and researchers alike but is largely unaccessible due to the
unstructured form in which it is represented.

There is thus valuable data about software development within the grasp of poten-
tial benefactors but unavailable to them due to lack of technology. NLP technologies
proposes real solutions to these problems but remain largely untested for the domain.
Different fields within Life sciences have similar sources of unstructured data produced
by practitioners, e.g. patient journals, from which large quantities of immensely valuable

1


CHAPTER 1. INTRODUCTION

data could be extracted. For this purpose there are tools1, research groups, commercial
ventures2, and even a dedicated conference3 for NLP applications within Life sciences.
As such the application of NLP technology on Life science data would be considered
successful. This is a promising indicator that the same kind of success could extend to
applications within SE.

For the above reason we intend this thesis to evaluate the use of relation extraction
systems for free text on text data generated in Git repositories during the software
development process. More specifically the focus will be on Git commit messages. The
idea is to take successful concepts from previous relation extraction research on the
generic case with well structured English sentences and apply them onto Git commit
messages. Reporting and discussing the results will contribute to evaluating Relation
extraction, in its current state, as technology ready to support this kind of data and
task. We formulate the following research question:

How successfully can supervised relation extraction systems trained on En-
glish free text be applied on Git commit messages?

To answer this question, an experiment will be conducted where a system for relation
extraction will be applied to commit messages from some of the largest and most active
repositories available on GitHub. Results from the experiments will act as data for a
quantitative and a qualitative discussion.

The remainder of this thesis will introduce related work in Section 2 followed by a
background for the study in Section 3 which introduces NLP, relation extraction, and
tools in these domains. The background also introduces a specific system, Relex, which
will be used in the experiments conducted in the study. Section 4 present details about
the methodology used for the research including data, experiments, and delimitations.
Section 5 lists the results from the experiments while Section 6 contains discussion and
analysis of the results in order to answer the research question. Section 7 contain an
analysis of the threats to validity and finally Section 8 will provide final conclusions and
future work

1http://zope.bioinfo.cnio.es/bionlp_tools/get_all_bionlp_tools_out?SUBMIT=Submit+Query
2http://www.ezdi.us/ezcac/technology/
3http://2013.bionlp-st.org/

2

http://zope.bioinfo.cnio.es/bionlp_tools/get_all_bionlp_tools_out?SUBMIT=Submit+Query
http://www.ezdi.us/ezcac/technology/
http://2013.bionlp-st.org/


2
Related work

Data use in Software Engineering (SE) is studied by Halkidi et al. [17] who present a
rundown of different data mining techniques and how they can be applied to extract
information from software repositories in the interest of SE research. The authors in-
troduce text mining and Natural Language Processing (NLP) as a special case of data
mining and mention it as a possible way of extracting data from SE activities. Parallelly
Hassan and Xie [18] coin the term software intelligence as a relative to business intel-
ligence. They suggests that up-to-date information derived from software repositories
can be used not only by software developers but also by decision makers, software re-
searchers, and educators within SE. The authors list text data as one of three important
data sources on which mining algorithms can be used to extract valuable information.

From the workshop Mining Software Repositories 1 a large quantity of tools have
emerged [7]. A number of these tools are developed and used to mine text data from bug
reports. These include both Runeson et al. [31] and Wang et al. [37] who use text mining
and information retrieval techniques to identify duplicate bug reports. Similarly Wu et al.
[39] introduce a tool reliant on the same algorithms aiding bug report redundancy and
autocompletion. Efforts made using other data sources include Chen et al. [8] who use
topic modelling to categorise software entities (e.g. methods, files, classes, and modules)
based on comments and identifier names. Doing this they find topics that are more defect
prone than others. Natt och Dag et al. [25] develop a tool for requirements management
using NLP techniques.

Not only is researchers using data gathered from GitHub in their research [35, 36]
but there is also work being done specifically on Git commit data. In a recent research
effort Pletea et al. [28] analysed commit messages using available tools for sentiment
analysis. Ali et al. [2] develop a tool that aims to keep traceability links between software
artifacts up to date. One of the components of this tool mines commit messages from
CVS and SVN repositories in order to create links between changes and the requirements

1http://2015.msrconf.org/

3

http://2015.msrconf.org/


CHAPTER 2. RELATED WORK

affected by them. This is done with simple NLP using vector space model and similarity
measurements. In all of the tools mentioned here NLP technologies, albeit simpler than
relation extraction, are used to mine information from software repositories. Information
that can then be used to aid the development process or research of the same.

4


3
Background

In structuring unstructured data in the form of natural language there are two highly
relevant areas of research: semantic parsing and information extraction. Semantic pars-
ing is based on the structure of language and goes from free text to syntactic structure
to semantic meaning. The idea is that semantic meaning can be found in syntactic
components found in the text. A subcategory of semantic parsing is shallow semantic
parsing also called Semantic Role Labeling (SRL). In SRL the task is to find predicates
of sentences and all their constituents such as object and subject. A corpus of this
is PropBank where intricate structures of arguments and predicate modifiers interact
to create meaning [27]. By using the predicate as relationship, and its constituents as
entities, SRL can work as a relationship extraction technique [9].

Information extraction in contrast is a more direct and pragmatic approach where
less computationally expensive methods can be applied. Information extraction systems
can still use the signals of syntactic structure, e.g. part-of-speech tags [5] and dependency
parses [41], however the information is only used as features and hence does not constitute
the entire analysis. As opposed to semantic parsing where every part of the text is
included in the model, information extraction may focus on the parts of text that convey
meaning. Since the data under consideration in this thesis is potentially of low syntactic
quality with a lot of noise we propose information extraction as a solution domain more
suited to deal with our problem. This chapter will introduce relation extraction generally
and Relex, a relation extractor used in the experiments of this work, more specifically.

3.1 Relation extraction

Systems doing information extraction takes as input raw text and outputs a set of rela-
tionships between entities found in that text [5]. This task can further be divided into
subtasks, normally consisting of at least named entity extraction and relation extrac-
tion [11]. From the point of view of relation extraction any other subtasks of information

5


3.1. RELATION EXTRACTION CHAPTER 3. BACKGROUND

extraction can be seen as pre and post-processing. Named entity recognition, often a
pre-processing step to relation extraction, is a separate research area in itself and is often
excluded from discussion in papers about information extraction systems. These focus
instead on the relation extraction algorithm and how it is combined with named entity
recognition and other pre-processing steps. The relationships extracted by these systems
are often referred to simply as relations or as relation tuples with reference to the two
entities between which the relation exist.

Relation extraction systems are normally limited to finding relations explicitly ex-
pressed in the text [34]. Latent and implicit relations are much harder to infer and
normally require a different system entirely [12]. Another common delimitation is to
only look for binary relations [5, 40]. Systems that do look at n-ary or nested relations
are those closer related to semantic parsing [9].

As an example of what relation extraction produces, consider the following sentence:

“On October 9, 2006, it was announced Google will purchase YouTube for
US$1.65 billion in stock”

In this sentence a binary relation, will purchase, should be found between the two
named entities Google and YouTube. A nested relation would be to add the amount
(US$1.65 billion) or the date October 9, 2006 as a modifier to the entire relation.

Relation extraction from large scale corpora was originally restricted to one or several
predefined relation types per effort. DIPRE [6] is a prominent system in this traditional
relation extraction paradigm with expansions such as Snowball [1]. Both of these are
limited to extracting instances of a single type of relation. The Open relation extraction
paradigm was invented as a reaction to this limitation. Efforts in this domain are not
restricted to a fixed number of relation types. The first system in this domain was
TextRunner [5]1 later systems include ReVerb [13], WOE [38], OLLIE [32], SONEX [22],
and an unnamed system that will be referred to as XU [40] in the remainder of this
thesis.

Another dimension in which to categorize different relation extraction efforts is what
level of supervision the system requires. Most commonly systems are either supervised
such as XU or semi-supervised such as TextRunner. Additionally some systems are ex-
ploring the use of coupled sources of structured and unstructured data used for training.
This is called distant supervision [23] a variant of which is used in WOE where Wikipedia
articles and information boxes are compared for patterns of relations. There is also a
movement of unsupervised systems which use clustering techniques to group relation
tuples by which relation they express [30], SONEX is one such system.

Each individual relation extraction system vary to some degree from other system in
its domain but they also share many similarities between them. The system is normally
a pipeline where raw text is the input and relation tuples are generated at the end. For
each pipe in the pipeline an additional piece of information is attached and later used

1It should be noted however, that even though Banko et al. [5] are the first to suggest open information
extraction as a branch of information extraction, preceeding systems such as Jinxiu et al. [19] operate
within the same domain without labeling it as such.

6


3.1. RELATION EXTRACTION CHAPTER 3. BACKGROUND

by other pipes in order to achieve the end result in aggregate. The relation extraction
process normally work with signals from other Natural Language Process (NLP) tasks
which hence needs to be performed before relation extraction starts. All of these steps
can be seen as pre-processing. The relation extraction process itself normally starts with
a relation candidate identification step where possible relation tuples are determined.
Following this the relation tuples are assessed for quality and finally the relation tuples
found to be trusted are labeled with a relation phrase. This division follows the work of
Xu et al. [40] who label these two concluding steps the binary task and the triple task.

3.1.1 Pre-processing

Different systems work with different features for the relation extraction process. From
a relation extraction perspective, adding information for these features to the data is
considered pre-processing. Common pre-processing steps are sentence detection, tok-
enization, and part-of-speech tagging. A sentence detector finds sentences by looking for
their ends. Similarly, a tokenizer finds individual words, called tokens, and tags them as
such. During the part-of-speech tagging each token gets assigned a part-of-speech label
e.g. verb, adjective, or noun. Sentence detection and tokenization is done as part of the
pre-processing for all relation extraction systems in this thesis. Part-of-speech tags are
not used by DIPRE and Snowball but by all the others. The authors of both TextRunner
and SONEX both claim to use “light NLP” techniques where TextRunner in addition
to sentence detection, tokenization, and part-of-speech also use a noun-phrase chunker.
The reason for this restriction is that any more advanced techniques would not perform
as well on the noisy data they anticipate.

Another common pre-processing step to use is dependency parsing which is a type of
syntactic analysis applied on sentences. The analysis produces a hierarchical syntactic
graph with words as nodes and relations between words as edges. The edges can either
be labeled or not. This structure is called a dependency tree. The dependency tree has
a root which is normally the predicate with nodes underneath being arguments and sub
clauses. The dependency tree is very suitable for relation extraction and features from
dependency trees are very common systems in the domain and used by ReVerb, WOE,
OLLIE, and XU among others.

Commonly relations are defined as existing between two entities with a relation
phrase to describe what relation holds up between them [5, 38, 40]. What an entity is
can however be interpreted differently. Most of the systems discussed in this chapter
interpret noun phrases as entities while other options include the use of external sources
to define an entity [34] or extract entities based on regular expressions like done in
DIPRE and Snowball.

Additionally there are those like SONEX that use named entities as the entities
and named entity recognizers to extract entities for their relation extractors. The term
Named Entity was coined to describe a task in the sixth Message Understanding Confer-
ence (MUC-6) [16]. For this task a named entity was described as a unit of information
and detailed with a few examples including persons, geographic locations and organiza-
tions. While these are three common categories to test named entity recognition, others

7


3.1. RELATION EXTRACTION CHAPTER 3. BACKGROUND

have suggested additional categories to suit the needs of the application for which named
entities are employed. A few examples include chemical names, job titles, and project
name [24].

Commonly both pre-processing pipes and supervised relation extraction pipes use
machine learning components trained on gold standard data. Such data is hand anno-
tated by multiple annotators with a systematic way of peer-review [21].

3.1.2 Candidate identification

Relation candidate identification is where possible instances of relations are found in the
raw text. This step is often followed by another called relation candidate extraction which
is where each found instance is extracted along with its context for further processing.
Such an extracted instance is often called a relation candidate. In this thesis these steps
are bundled under the relation candidate identification step.

There are a few different strategies for relation candidate identification among the
systems mentioned previously. The most common one is entity centric relation identifi-
cation. This is where entities occurring together within a boundary are paired up to form
an instance from which to draw a relation candidate. DIPRE uses this strategy bounded
by a set number of maximum tokens in between the entities. Snowball, TextRunner,
WOE, and SONEX all limit their bounds to entities occurring within the same sentence.
XU also bounds relation candidates to within sentences but additionally require a maxi-
mum number of token between the entities. In contrast to these entity centric strategies
ReVerb uses a relation centric approach where the relation phrase is extracted first. The
context is then matched for entities in its surroundings using heuristics. Finally OLLIE
identifies relation candidates by matching patterns for both entities and relation phrase
in combination.

3.1.3 Binary task

The binary task as defined by Xu et al. [40] is to determine whether or not a relation
exists between two entities. For the authors and their system, XU, this is done by classi-
fying relation candidates as true or false. The classification is done with a Support Vector
Machine (SVM) [10] using tree kernels based on dependency parses. TextRunner [5] em-
ploy the same type of classification but use a Naive Bayes classification algorithm instead
of the SVM used in XU. Features include part-of-speech sequence before, between, and
after entities. TextRunner’s classifier is trained using relations from a type of linguistic
parse called constituent parse.

Later versions of TextRunner [4] as well as WOE instead employed a graphical model
called Conditional Random Fields (CRF) [20] where each token between two entities were
labeled as part or outside of a relation phrase. In this design, if no relation phrase is
found there is no relation.

DIPRE and Snowball both use primitive pattern matching techniques where each
relation candidate is matched against generated patterns to determine if there is a rela-
tion or not. The patterns are made up of features including tokens before, between, and

8


3.1. RELATION EXTRACTION CHAPTER 3. BACKGROUND

after the entity pair as three separate ones.
OLLIE and ReVerb both assigns relation candidates a confidence value using a logistic

regression classifier. Relation candidates with a confidence value below a threshold value
can later be discarded.

SONEX, using an unsupervised approach, is structured somewhat differently com-
pared to the other systems discussed here. For the binary task it simply assumes that
a relation exists between two named entities if they occur in the same sentence. The
system then use clustering algorithms to group similar relation instances with eachother.
Relation instances in the same cluster are then assumed to be of the same relation type.
Features used by the clustering algorithm include unigram tokens, bigram tokens, and
part-of-speech sequence.

3.1.4 Triple task

Xu et al. [40] define the triple task as classifying a relation phrase as true or false with
the assumption that the relation candidate truly contains a relation. This definition
is good fit for how XU is structured. The triple task in XU is done by heuristically
eliciting relation phrase candidates based on patterns. This relation phrase candidate is
then classified as true or false with an SVM. Here we propose a slightly broader defintion
of the triple task where both of these steps can be transformed into one. That would be
to simply identify a relation phrase which explains the semantic relation between two
entities. Using this definition we can explain the triple task solution in SONEX. Here
relation instances are clustered and a relation phrase is extracted for each cluster to label
all the relation instances in the cluster with. This is done by eliciting the centroid of the
tokens between entities in each cluster.

The first version of TextRunner [5] extracted all tokens between entities as a candi-
date relation phrase and then heuristically eliminated words using part-of-speech tags.
The relation phrase was then used as a feature for classifying in the binary task and
accepted as relation phrase if the entire relation candidate was classified as a true re-
lation. Later evolutions of TextRunner [4] as well as WOE take care of the triple task
and binary task in parallel using a CRF model to label tokens between entities as part
of the relation phrase or not. Similarly OLLIE employs pattern matching with a depen-
dency tree against the training data to elicit both entities and relation phrases. Elicited
relation candidates are then assessed for confidence as described in the binary task (Sec-
tion 3.1.3). Features used by these systems are largely the same and include tokens,
part-of-speech unigrams, part-of-speech sequences, and in some cases dependency trees.

The division of a systems functionality into a binary and a triple task is incompletely
due to some systems simply being structured differently. ReVerb for example takes
care of the triple task as part of the candidate identification. The system finds relation
phrases as relation candidates and then use a set of heuristics to locate the closest noun
phrase to the left and right of the identified relation phrase. These noun phrases are
then used as entities between which the found relation phrase holds. ReVerb is as such a
relation centric system whereas the other systems detailed here are instead entity centric.
Simlarly the nature of traditional relation extraction systems like DIPRE and Snowball

9


3.2. THE RELATION EXTRACTOR RELEX CHAPTER 3. BACKGROUND

where only a single or a limited set of relation types are evaluated make the triple task
redundant.

3.2 The relation extractor Relex

So far we have established that semantic parsing and SRL breaks sentences down into a
structure of predicates and arguments hoping to explain meaning with these components.
Since this is done using syntax, assuming correctly formulated English sentences, it is
reasonable to expect that when syntax is bad these techniques will fail. Considering the
potentially low text quality data this thesis circulates around leads to the conclusion that
information extraction is a better solution domain for the task at hand. Furthermore, as
is the case with TextRunner and SONEX, a system that solves the problem will likely rely
on “light NLP” techniques for the same reason of noisy data. Based on this reasoning,
with influences from the systems described under the previous section (Section 3.1), we
have developed a simple yet effective (see Appendix D and E) system in collaboration
with an industry partner, Findwise.

The system, which we refer to as Relex [3] is a system in the open relation extraction
domain. The pipeline has pre-processing with light NLP techniques, a candidate identi-
fication step, a component for the binary task (binary classifier), and one for the triple
task (relation phrase extractor).2

3.2.1 Pre-processing

In order to operate on the data a few steps of pre-processing is made up of sentence detec-
tion, tokenization, and part-of-speech tagging. The sentence detection and tokenization
splits the text into units of sentences and tokens respectively while the part-of-speech
tagger assigns a part-of-speech label to each token. All these are performed with com-
ponents from Apache OpenNLP3 with free models released by Apache4 and trained on
data dervied from well structured English texts such as newswire or published papers.
The relation extractor expects entities as input to extract relation candidates. Hence,
Relex have capabilities to support entities extracted from both a noun phrase chunker
or a named entity recognizer5.

3.2.2 Relation candidate identification

The relation candidate identification is the first step of the actual relation extractor. The
candidate identifier combines pairs of entities into relation candidates to be evaluated
further. Additional rules can be added to this step in order to limit what is considered
a possible relation. An example is to only allow for relations where entities co-occur

2Relex can be found in the project repository at https://github.com/Rikard-Andersson/relex
3https://opennlp.apache.org/
4http://opennlp.sourceforge.net/models-1.5/
5Since all entities in the experiments will be derived from annotations a noun phrase chunker or

named entity recognizer is left out of the implementation

10

https://github.com/Rikard-Andersson/relex
https://opennlp.apache.org/
http://opennlp.sourceforge.net/models-1.5/


3.2. THE RELATION EXTRACTOR RELEX CHAPTER 3. BACKGROUND

within a boundary such as a sentence or a span of a maximum number of tokens. This
system by default only limits to the boundary of a sentence but have the capacity for a
maximum tokens distance boundary as well.

3.2.3 Binary task

In the binary task each relation candidate identified is classified as true or false. Like XU,
Relex use a supervised algorithm where the binary task is solved with a support vector
machine (SVM). Contrary to XU the SVM does not use a kernel tree but instead rely
on two kinds of light NLP features: bag-of-words (BOW) and part-of-speech sequences.
These are collected for three different positions: before, after, and between the entities
of the relation candidate. BOW is simply a feature where all the tokens found in the
range of the feature are represented ignoring the order of the tokens. A part-of-speech
sequence stores the total sequence of part-of-speech tags for the feature and hence takes
order into account. This follows the work of such system as SONEX and TextRunner
which both rely on BOW and part-of-speech tags and sequences as features.

3.2.4 Triple task

The triple task decides on how to label a relation using explicit words occurring in the
context of the relation. The Relex component solving the triple task is referred to as
the relation phrase extractor. The relation phrase extractor is a rule based component
operating on part-of-speech tags between the entities in a relation. From the sequence
of tokens in this context the last verb is selected as the relation phrase. If no verb is
available it insteads finds the last noun. If none of these occurs it simply uses the last
token in the sequence between the entities. Since the component rely on heuristics it can
be considered unsupervised. The behaviour is similar to systems such as ReVerb which,
instead of locating a relation phrase, use heuristics to locate the entities referred to by
a relation phrase. XU also use heuristics to elicit a relation phrase but then goes on to
classify this relation phrase as true or false.

11


4
Methodology

In order to provide support for a discussion and to arrive at an answer to our research
question we will conduct an experiment. The experiment is centered around a relation
extraction system, Relex, as introduced in Section 3.2. Relex is inspired by state-of-
the-art relation extraction systems and will be used with models trained on well formed
English sentences. The idea is to put Relex to work on git commit messages fetched
from GitHub and evaluate the results both quantitatively and qualitatively.

4.1 Data

Two datasets are used for the experiments in this thesis. One is used to elicit training
data for the binary classifier component of Relex. Another one is used to elicit GitHub
commit messages which will later be used as actual test data.

4.1.1 Training data

Based on a set of relation candidates Relex performs the binary task, binary classi-
fication of the relation candidates as true or false. Models for the SVM component
used in Relex to perform the binary task (Section 3.2.3) are generated feeding the sys-
tem with labeled training data. The training data is derived from the PropBank-style
predicate-argument annotations provided in the training and testing data for the ConLL
2011 shared task [29]. The data originally used for coreference resolution comprise of
syntactically well structured English sentences from newswire sources such as news arti-
cles, broadcast conversations, magazine articles, and web data. Additional to predicate-
argument and coreference the data also include annotations such as part-of-speech, word
sense, and named entities. We refer to this data as the ConLL data. An exerpt from a
sentence as it appears in the ConLL data can be found in Appendix C. The ConLL data
is divided into a big set intended for training and a much smaller set intended for testing.

12


4.1. DATA CHAPTER 4. METHODOLOGY

The training data is comprised of 44,687 sentences while the test data is comprised of
5,814 sentences.

From the ConLL data training set three different sets are extracted meant for training
of three different models for the binary classifier Relex uses1. The three sets differ
mainly in what types of entities relations are allowed between. The first dataset only
allow relations where both the first and second arguments are named entities, this set is
henceforth described as NE-NE (Named Entity - Named Entity). The second data set
allows one of the arguments, first or second, to be a noun phrase, this set is henceforth
known as NE-NP (Named Entity - Noun Phrase). The third and final dataset allows
relations between any two nouns or noun phrases and will henceforth be known as NP-NP
(Noun Phrase - Noun Phrase).

Apart from entity restrictions the datasets also differ in how relations are elicited
for them. The relations for the NE-NE set are found by looking for named entities
as tagged by the ConLL annotators and extracting all pairs as relation candidates. If a
relation is found in the PropBank predicate-argument annotations the relation candidate
is considered a true relation, representing a positive data point. All the pairs that are
not connected by a predicate end up as false relations representing negative data points.
Using this technique we end up with a training set of 6,677 data points with a distribution
over negative and positive as seen in Table 4.1.

In the NE-NP and NP-NP datasets the negative data points are collected by discrimi-
nating based on a dependency tree generated by MaltParser [26]. As a first discriminator
relation candidates are extracted from the PropBank predicate-argument annotations.
Next, the distance in the dependency tree is calculated between the first and second
entity. If the distance is greater than three we label the candidate as false, less than
three as positive and if it is precisely three we leave it out entirely. Additionally, in order
to be considered a true relation, the arguments (noun phrases or named entities) must
follow the predicate pattern with a subject and an object as elicited by the dependency
tree. Any other label and they are reject as a false relation. This way only explicit bi-
nary relations are elicited and nested relations and subclauses are left out. The detailed
method renders the NE-NP dataset with 3,692 training points and NP-NP dataset with
10,131 training points with a distribution over negative and positive as seen in Table
4.1.

Positive Negative

NE-NE 1811 4866

NE-NP 1278 2414

NP-NP 3811 6320

Table 4.1: Training data distribution over different sets

1Corresponding sets are extracted the same way from the ConLL test sets and used for testing Relex
performance on both the binary and triple task as seen in Appendix D and E respectively

13


4.1. DATA CHAPTER 4. METHODOLOGY

The reason for less strict rules for the extraction of relations in NE-NE set is the fact
that it otherwise turns out very small. The opposite goes for the NP-NP set in applying
the NE-NE procedure, it would become very large and require a lot of processing power
and memory. A total of 120,000 relation candidates (11,171 positive and 99,736 negative)
would be the result.

Having three different sets of training data constructed using different methods enable
us to experiment with a wider range of natural language techniques in order to find the
most promising one for our purposes. The NE-NE dataset contain signals from relations
with named entities specifically. The NP-NP set is more likely to contain directly binary
relations and the NE-NP set is a mix between them both. The models produced by each
dataset will be referred to as NE-NE, NE-NP, NP-NP corresponding to which dataset
was used to train it. Appendix D and E show results from testing Relex on the smaller
ConLL test dataset for the binary and triple task respectively. This serves as validation
that Relex is at least a descent system for relation extraction on well structured English.

4.1.2 Test data

The git commit data is fetched from GHTorrent [15] which is a data collection initiative
that extracts data directly from GitHub. The initiative originated from a need in a re-
search project in 2012 and later extended to a complete and public repository. Today the
complete GHTorrent contain information from the majority of events on GitHub from
2012 through to 2014 and continuous backtracking is made to include earlier data. Ex-
amples of the data stored include commit messages, commit comments, issues, and issue
comments. GHTorrent’s data collection is limited to the publicly available repositories
and hence all repositories in private mode will not be represented.

Due to the size of the complete dataset (3.5 TB) this work have been performed
on one of the smaller subsets of the same datasets, namely the “MSR 2014 Mining
Challenge Dataset”. This subset has the same structure as the full set and contains all
data from the top-10 most starred repositories, including all their forks, from the nine
most popular programming languages on GitHub. The full list of projects included in
the dataset can be found in Appendix F. The resulting dataset contains 601,080 commits
with corresponding commit messages. From these commits we randomly2 selected and
annotated 600 commit messages. These annotations with its underlying data is to be
included as test data in our experiments and to represent GitHub commit messages. We
refer to this data as the GHTorrent data.

Each commit message was annotated with entities and relations. As evident by the
background (Sections 3.1.1 and 3.1.2) different systems employ different definitions of
what constitutes an entity, be it a noun phrase or some other set of categories for named
entities. For the application studied in this thesis useful entities would be connected to
the domain, e.g. names of methods, classes, issues, and files. We propose and use a list
of entity types between which useful relations could exist (see Appendix A).

Two different relation types was considered in the GHTorrent data: entity relations

2We used atmospheric noise from http://www.random.org/ to generate random numbers [3].

14

http://www.random.org/


4.1. DATA CHAPTER 4. METHODOLOGY

and author relations. Entity relations are those fulfilling the normal case of relation
extraction where a relation is found to exists between two named entities in the text.
Entity relations are fully explicit which follows the delimitations we will discuss in Section
4.3 below. After a quick visual inspection of the data a lot of relations between an implicit
author entity and an explicit entity were found in the text. Below follows three examples
of these type of relations found in the actual GHTorrent test data.

First example is where the author has done something to something, e.g. changed a
file:

“Update CHANGELOG to mention the json escape change”

In this case there is a relation labeled ”Update” between an implicit author entity and
the ”CHANGELOG”. A second example is where the author have solved an issue:

”Fixes #1097.”

Where the author have fixed the issue named ”#1097”. Note that the two examples
above could also refer to the commit itself rather than the author. We allow for this
variation of what the implicit author entity may refer to as long as we find an explicitly
expressed relation between the implicit and the explicit entitiy. A third example is where
the author have merged another branch with the current one:

“Merge branch ’master’ of github.com:akka/akka”

In the above commit message there is a relation between an implicit author entity and
the branch entity “master”. We choose to label the relation types described here as
author relations. There is normally no explicit author entity in the commit message and
our system handles only relations between entities which are found in the same sentence.
Therefore, during relation candidate identification, we add a mock author entity at the
beginning of each sentence and let author relation candidates be those between this
entity and other entities found in the sentence.

Annotation was done by the author of this thesis alone. Each document was anno-
tated in two steps. First step entailed elicitation of all entities using a list of allowed
entity types. See Appendix A for the full list of entity types. In the second step the
entities from the first iteration were considered for relations with each other, each with a
connecting relation phrase between them. A list of rules and guidelines was followed in
an effort to construct a dataset of consistent annotations in line with the delimitations
of the study. The full list of guidelines can be found in Appendix B. The procedure
with two passes to annotate entities and relations separately mimics the behaviour of
the relation candidate identification used in Relex. It also ensures that no entities are
left out because it is not in a relation tuple. Doing so would be to give Relex an unfair
advantage since fewer negative relation candidates would be found. For annotation pur-
poses we used the brat annotation tool [33]. The full annotation project is available at
https://github.com/Rikard-Andersson/GHTorrent-Brat.

15

https://github.com/Rikard-Andersson/GHTorrent-Brat


4.2. EXPERIMENTS CHAPTER 4. METHODOLOGY

4.2 Experiments

Relation extraction as described in Section 3.1 pre-processes raw text by attaching anno-
tations needed for the relation extraction process itself. The relation extraction process
itself can then be divided into relation candidate identification, binary task, and triple
task. Relation candidate identification finds possible instances of relations in the pre-
processed text and marks them as relation candidates. The binary task is then to label
relation candidates as true or false and the triple task takes care of finding a relation
phrase for the instances labeled as true. Relex, described in Section 3.2, is structured
in this exact way. Two categories of experiments were conducted on the GHTorrent
data, one testing the Relex component solving the binary task and one testing the Relex
component solving the triple task.

4.2.1 Binary task

For each model (NE-NE, NE-NP, and NP-NP) the binary task was tested for author
relations and entity relations separately. Thus, a total of six tests were run, two for
each model. The GHTorrent test data was pre-processed using the same pre-processing
used in Relex with tags added for sentences, tokens, and part-of-speech. The entity
annotations were derived from the handmade annotations and the relation candidate
identification was done using the Relex component for this task.

Authors in the field of relation extraction, as with many other machine learning prob-
lems, use recall, precision, and f-measure to evaluate performance of their algorithms,
methods, and systems [1]. The same metrics will be used to measure performance on
the binary task in this work as well.

4.2.2 Triple task

Both entity relations and author relations were tested for relation phrase extraction
and an additional aggregate performance was calculated. The triple task is sequentially
dependent on the binary task which may mislabel false relations as true. As a result the
relation phrase extractor might in a live environment attempt to find relation phrases
between entities that have no relation between them. This type of faulty behaviour
should be attributed to the performance of the binary classifier and will thus not be
accounted for in the evaluation of the triple task. Therefore only the true relations
from the annotations will be used in these tests. Thus, the triple task component of
Relex, also called the relation phrase extractor, will be tested separately from the rest
of the system. The true relations are passed through the relation phrase extractor and
evaluated as correctly elicited only if the complete phrase is identical to the one annotated
by hand.

When evaluating the triple task there are no fixed classes to assign the answers to
and presenting recall and precision becomes redundant. This is due to relation phrases
being one in every relation candidate and one relation phrase is extracted per relation
candidate by the algorithm. If the relation phrase extracted by the algorithm is wrong

16


4.3. DELIMITATIONS CHAPTER 4. METHODOLOGY

this will register as a false positive while the true relation phrase will register as a false
negative. This symmetry results in precision equaling recall. Instead the performance of
the relation phrase extractor is measured by accuracy, i.e. how often it finds the correct
relation phrase which is incidentally the same as both recall and precision in this case.

4.3 Delimitations

The delimitations of the experiments conducted in this thesis are implied by the de-
limitations of the system used to perform them, Relex. Relex has in turn inherited its
limitations from the systems in the relation extraction domain that inspired it.

An implicit relation can be exemplified by an instance where entity A has an explicit
parent relation to both entity B and entity C, as illustrated in Figure 4.1. There is
hence an implicit sibling relation between B and C [12]. As is the case with most of
the systems described in the background (Section 3.1), this study is limited to only
explicit relations. Moreover, this study is limited to relations where the explicit phrase
expressing the relation is located between the entities in the relation candidate. This
delimitation makes for a less complex task and leaves a more manageable scope.

Figure 4.1: Explicit parent relations and implicit sibling relations

Binary relations are those where the relation only include two entities. N-ary relations
in contrast are such that relations extend beyond the two entities and can include several
other entities with intricate relations among them. Semantic role labeling can produce
n-ary relations where the predicate-argument structure may include several entities.
However, when these are converted into a relation extraction system they are often
converted to binary relations and any more elaborate structures are discarded [9, 14].
This study is limited to binary relations.

All data and models used in Relex and in the experiments are for text data expressed
in English. Relex could of course be trained and tested with data from other languages
but the question is how relevant this would be in a software engineering context assuming
that most software engineering activities are conducted in English.

What constitutes an entity varies across different systems. Delimiting in this dimen-
sion comes down to what is useful for the problem at hand. See Section 4.1.2 for detailed
description of how the issue is handled in this thesis. The task of entity recognition or
extraction is left out of the experiments and will be derived from hand annotations.

17


4.3. DELIMITATIONS CHAPTER 4. METHODOLOGY

The delimitations mentioned in this section are implied for the entire study and
extend to both datasets as well as the implementation of Relex. In short, delimitations
on relations are to semantically explicit, binary relations between entities expressed in
English text.

18


5
Results

In order to evaluate how well relation extraction can be applied to git commit data
experiments have been conducted using the relation extractor Relex. The data used
for the experiments were fetched from public projects on GitHub and annotated with
entities and relations between these. The tests were conducted with each of the two
main components of Relex separately. Each component deals with a specific task: the
binary task and the triple task.

5.1 Binary task

The binary task is a classification task where a relation candidate is being labeled as
true or false representing if the relation candidate indeed contains a relation or not. The
results are divided into what relation type is being evaluated. Recall, precision and F-
measure are calculated. Recall measures the proportion of predicted true relations that
are actually true to all true relations. Precision measures the proportion of correctly
predicted true relations to all predicted true relations. F-measure is a harmonic mean of
recall and precision. For each relation type the performance of the three models NE-NE,
NE-NP, and NP-NP are presented separately.

5.1.1 Entity relations

Entity relations describe relations between two explicit entities in text with an explicit
relation phrase binding them together. From the 600 commit messages Relex found
678 entity relations using the relation candidate identification component. 66 of these
relations are true relations (by annotations) and the remaining 612 hence false.

From tables 5.1, 5.2, and 5.3 it is evident that the NE-NE model is best able to find
true relations (recall of 85 %). Compared to the other two models the NE-NE is more
generous in assigning true relations and predicts well over half of the relation candidates
to be actual relations while the other two are more restrictive and predict about 15-20 %

19


5.1. BINARY TASK CHAPTER 5. RESULTS

Predicted

True False Total Recall 0.848

True 56 10 66 Precision 0.141
Actual

False 341 271 612 F-measure 0.242

Total 397 281

Table 5.1: Performance of Relex on the binary task using the NE-NE model on entity
relations from the GHTorrent data

Predicted

True False Total Recall 0.273

True 18 48 66 Precision 0.167
Actual

False 90 522 612 F-measure 0.207

Total 108 570

Table 5.2: Performance of Relex on the binary task using the NE-NP model on entity
relations from the GHTorrent data

Predicted

True False Total Recall 0.409

True 27 39 66 Precision 0.205
Actual

False 105 507 612 F-measure 0.273

Total 132 546

Table 5.3: Performance of Relex on the binary task using the NP-NP model on entity
relations from the GHTorrent data

of the relation candidates to be true relations. As a result the NE-NE model produce a
slightly lower precision with only about 15 % of the true predictions being actually true.
Overall, precision scores are low with the NP-NP model scoring the highest with every
fifth extracted relation being a true relation. On the whole, entity relations extracted
by Relex with any of the three models are more likely not to be true relations by a wide
margin.

5.1.2 Author relations

Author relations express a relation between an implicit author or commit entity and an
explicit entity with an explicit relation phrase between them. From the 600 commit mes-
sages Relex found 1,118 author relations 338 of which are true relations (by annotations)

20


5.1. BINARY TASK CHAPTER 5. RESULTS

and the remaining 780 false.

Predicted

True False Total Recall 0.556

True 188 150 338 Precision 0.242
Actual

False 589 191 780 F-measure 0.337

Total 777 341

Table 5.4: Performance of Relex on the binary task using the NE-NE model on author
relations from the GHTorrent data

Predicted

True False Total Recall 0.473

True 160 178 338 Precision 0.432
Actual

False 210 570 780 F-measure 0.452

Total 370 748

Table 5.5: Performance of Relex on the binary task using the NE-NP model on author
relations from the GHTorrent data

Predicted

True False Total Recall 0.701

True 237 101 338 Precision 0.507
Actual

False 230 550 780 F-measure 0.589

Total 467 651

Table 5.6: Performance of Relex on the binary task using the NP-NP model on author
relations from the GHTorrent data

Compared to the performance on entity relations detailed in Section 5.1.1 Relex
produce a better overall result on author relations using any of the models. From the
performance presented in tables 5.4, 5.5, and 5.6 it is obvious that the NE-NE model
is once again most generous in assigning true relations to candidates doing so rather
unsuccessfully with a precision of roughly 25 %. It does however manage to find 56 %
of the positive relations which is better than the NE-NP model with 47 %. The NP-NP
model does however outperform both of the other two models quite clearly with a recall
of 70 % a precision of 51 %, and a harmonic mean of 59 % making it the best results on
the binary task. The relations it assigns as true are however just barely more likely to

21


5.2. TRIPLE TASK CHAPTER 5. RESULTS

actually be true than false. Hence, in the best case scenario, approximately half of the
relations extracted by Relex would be inaccurate.

As a benchmark the results for both entity relations and author relations can be
compared to those of applying Relex to solve the binary task for the ConLL test data.
These results can be seen in Appendix D.

5.2 Triple task

The triple task is where the relation phrase between actual relations are to be extracted.
Relex does this with a simple rule based annotator working exceptionally well on the
ConLL data (see Appendix E). The performance of the triple task is measured in accu-
racy and the results on the GHTorrent data can be seen in Table 5.7.

Relation type Accuracy

Entity relations 0.515

Author relations 0.675

Both kinds of relations 0.649

Table 5.7: Relex triple task accuracy on different relation types in the GHTorrent data

The relation phrase extractor is correct in about two of three cases in the author
relations and about every other case in the entity relations. Just as in the binary task
Relex struggles more with the entity relations than with the author relations despite the
fact that the author relations are a unique case constructed for the GHTorrent data. This
indicates that entity relations are indeed expressed differently in the GHTorrent data
than it is in the ConLL data which is considered to be well formed English sentences.
There is a significant drop from the 92-99 % accuracy of the relation phrase extractor
on the ConLL data (see Appendix E).

5.3 Qualitative results of pre-processing

The pre-processing used to run Relex poses less of a problem on well formed English
sentences. In git commit messages however, the syntax and grammar are not of the same
standard and the vocabulary is different. It is therefore necessary to qualitatively look
at the performance of pre-processing. This provides more material to use for analysis of
performance of the binary and triple task components of Relex on GHTorrent data. It
should be noted that in some cases in the GHTorrent data, even a human will have a
hard time labeling the data due to the specialized way of language use in the commit
messages.

22


5.3. QUALITATIVE RESULTS OF PRE-PROCESSING CHAPTER 5. RESULTS

5.3.1 Sentence detector

Here follows some qualitative results of the sentence detector. Commit message authors
often logically split their statements by new lines which work as sentence delimiters.
Names with periods do however sometimes present a problem like in these cases where
an actual sentence is annotated as two:

“depends: update .gitignore for depends for osx and ios builds”

is annotated as two sentences: “depends: update .gitignore” and “for depends for osx and
ios builds”.

“Fix check for browser.mozilla so that Safari is not flagged as mozilla”

is annotated as two sentences: “Fix check for browser.mozilla” and “so that Safari is not
flagged as mozilla”.

“Lua: take 5.1.4-3 patch from MacPorts”

is annotated as two sentences: “Lua: take 5.1.4-3” and “patch from MacPorts”.
From annotations in the GHTorrent data no case occur where a sentence is split such

that the relation candidate identifier is not able to pair entities into a relation candidate
where there is an actual relation. However, with occurrences as those above of wrongly
detected sentences, this is a possible source of future errors.

The opposite relation where two sentences are interpreted as one also occur. The
two lines:

“Previous init will not work with Fedora15+
Fedora14 has been EOF 2011-12-08”

is interpreted as one sentence while it might be more appropriate to divide it into two
sentences. What this does is create a relation candidate between entities that would not
originally have a relation between them creating a source of possible errors for Relex.

5.3.2 Part-of-speech tagger

The part-of-speech tagging is another point where the pre-processing might be cause
for errors in downstream classification steps. Consider the following four exerpts from
commit messages:

1. “This commit gets rid off code wrapping that was previously used by...”

2. “SI-7331 tb.parse returns unpositioned trees”

3. “Instead combination of templateStats/accept(EOF) is used.”

4. “...but it was forgetting to reset ContainsLDynamicVariable in case eval is used.”

23


5.3. QUALITATIVE RESULTS OF PRE-PROCESSING CHAPTER 5. RESULTS

Sentence Word OpenNLP POS True POS

1 “commit” verb noun

2 “tb.parse” adjective noun

3 “templateStats/accept(EOF)” adjective noun

4 “ContainsLDynamicVariable” adjective noun

Table 5.8: Examples of part-of-speech (POS) mislabelings by OpenNLP part-of-speech
tagger on GHTorrent data

Each one of these sentences have tokens which are wrongly annotated by the part-
of-speech tagger used in Relex. These corresponding errors are listed in Table 5.8

All of the cases use highly specialized words for the domain from which the GHTorrent
data is derived. With a perfect system all of these would be labeled as nouns but clearly
are not by the part-of-speech tagger used in Relex. Mislabeling also happen for other
tags:

“This pull request doesn’t attempt to generalize this approach in any way and
re-use it all over the place due to the caution of possible accidental compati-
bility breakage.”

In this case “re-use” is labeled a noun where it should be labeled a verb.
Comparing the results obeserved in this chapter and the ones for the ConLL test data

(see appendices D and E) it is clear that applying Relex perrform worse when applied
to the GHTorrent data than when applied to ConLL data. This is true independent of
which type of model is used and for both the binary and triple task. Drilling further down
into vital processing components such as the sentence detector and the part-of-speech
tagger both make consistent mistakes on the GHTorrent data.

24


6
Discussion

The relation extractor Relex have been applied to extract relations from git commit
messages with severe drops in performance from other benchmarks. In order to draw
conclusions from the results and generalize them to the field of relation extraction and
the applicability of such techniques on git commit data we need to analyse the results
further. Relex is a system inspired by the frontier of the relation extraction research
field and proven to work on well structured English sentences. Moreover, the system is
based on very simple features derived from light NLP techniques. As such Relex can be
considered both a relative to the systems produced through relation extraction research
but also a common denominator. Any shortcomings of Relex based on basic concepts
will therefore be discussed as an extension to the entire family of relation extraction
systems on which Relex is based.

6.1 Analysis of results

The additional data presented and analysed here, expressed as excerpts from git commit
messages, is drawn from processing made with the NP-NP model. This is due to the
fact that this model produced the best results and show the most promise of being part
of a functional solution to the problem discussed in Section 1.

6.1.1 Sentences

From the examples presented in Section 5.3.1 of sentence detection performance it is
evident that the sentence detector is sometimes failing due to the way sentences are
expressed in the domain. One big problem seem to be punctuation and capitalization
being used differently from well formed English sentences. Dots occurring in entities such
as file or method names is causing the sentence detector to break a sentence. Another
example is the following sentence where one full sentence is detected while the author

25


6.1. ANALYSIS OF RESULTS CHAPTER 6. DISCUSSION

most likely intended for four different sentences, one for each semicolon and possibly a
separate sentence for “[DB MIGRATION REQ’D]”.

“[DB MIGRATION REQ’D] Added plugin database tables and DAO; Added
network column to relevant tables; Moved work of fetching active plugins to
common/init.php”

Relex relies on sentence detection for its relation candidate identification pipe. It uses
the annotations to limit relation candidates to only include entity pairs appearing within
sentences but also to find a suitable mock-entity for the author relations. If the sentence
detector splits a sentence relation candidates might be overlooked. This could not how-
ever, have affected the outcome of the experiments in this study since true relations
are elicited from annotations. If the sentence detector instead fail to make a split be-
tween two sentences additional false relation candidates are identified. This introduces
unnecessary instability as these in cross-sentence relation candidates are evaluated and
possibly falsely labeled as true. Moreover the author mock entity will be placed in the
wrong context and the wrong features are extracted for classification. Both these cases
cause an error each in the above sentence. “[DB MIGRATION REQ’D]” is classified as
a true relation with “network column” and the author relation with “DAO” and relation
phrase “Added” is labeled as false.

Other cases where the sentence detector used in Relex fails include the following
sentence.

“extracted the built in profiling out added pp=profile-gc-time”

It is obviously hard for a sentence detector trained on English to label this sample as
anything other than one sentence while it should perhaps be two: “Extracted the built in
profiling out.” and “Added pp=profile-gc-time.”. The author provide no signal to indicate
this would be the case which illustrate a point that will recur in this analysis: Git commit
messages are not well formed sentences and thus any machine learning algorithm trained
on such will have to be very noise tolerant to function well.

6.1.2 Part-of-speech

The part-of-speech tagger is another component of Relex that is mislabeling the GHTor-
rent data. Both the binary classifier and the relation phrase extractor are dependent on
part-of-speech tags. The binary classifier uses them as features and the relation phrase
extractor find tokens that match part-of-speech tags in a scheme of rules. As seen in Ta-
ble 5.8 the part-of-speech tagger has problems where domain specific language is used.
Another example of this can be found in the commit message below in which a true
entity relation can be found between “postgresql” and “postgres-xc”.

“postgresql conflicts with postgres-xc”

This is arguably a rather simple relation and should hence be easy to detect but Relex
classifies it as a false. The word“conflicts” is here labeled a noun part-of-speech tag while

26


6.1. ANALYSIS OF RESULTS CHAPTER 6. DISCUSSION

verb would be more correct. Moreover both “postgresql” and “postgres-xc” are labeled
with an adjective part-of-speech tag while they are infact nouns. Further complicating
the classification is the fact that no tokens exists before or after the entities which would
assumably not happen in the newswire based training data.

Relex performance on the triple task is also suffering due to mislabeled part-of-speech
tags. This is only natural since the relation phrase extractor is exclusively reliant on part-
of-speech tags to work. In the example from above the faulty part-of-speech tags does
not pose a problem since the relation phrase extractor gets it right anyway, one could
argue luckily so. In other cases however, tokens with wrongly assigned part-of-speech
tags do cause problems:

“cluster: Rename destroy() to kill(signal=SIGTERM)”

The token “Rename” is labeled as an adjective instead of a verb and ”cluster” is instead
selected as relation phrase in the author relation with ”destroy()”. In this case the
sentence detector could also be to blame for including “cluster”.

“Merge pull request #370 from Memphiz/airplay”

In this case, which we will see later is a recurring one, ”pull” is labeled a verb and ”merge”
a noun. The token “pull” being a verb is nothing strange in a normal vocabulary but
“pull request” in this domain refers to a noun phrase while “Merge” is actually a verb.

The cases above and the ones presented in Table 5.8 in the results (Section 5.3.2) can
all be attributed to words specific to the domain. This is hard for an annotator to handle
if it is not trained on the vocabulary in question. Part-of-speech tags and sequences of
these represent half of the features for the binary classifier and all of the features for
the relation phrase extractor. When this pre-processing step fails any processing done
behind it will falter as a result.

6.1.3 Recurring patterns

There are patterns in the GHTorrent data that are hard to interpret, both for an un-
knowing machine and any human not being directly familiar with what caused the pat-
tern. The pattern below (counted by the sentence detector as one whole sentence) occur
multiple times, sometimes with other branch names (in this case “trunk”)

“–HG–
branch : trunk”

The branch names have been annotated as entities in these cases and in each case an
author relation is wrongly labeled as true by Relex. Below are two cases of another
pattern that recur several times in the test data.

“svn path=/trunk/mono/; revision=111467”

“svn path=/trunk/mcs/; revision=91662”

27


6.1. ANALYSIS OF RESULTS CHAPTER 6. DISCUSSION

In these cases two entities are annotated one for the path (“/trunk/mono/”and“/trunk/mcs/”)
and one for the revision id (“111467” and “91662”). In each case where this pattern occur
an author relation is wrongly labeled as true for the revision id.

”git-svn-id: http://svn-commit.rubyonrails.org/rails/trunk@5893 5ecf4fe2-1ee6-
0310-87b1-e25e094e27de”

As a final problematic pattern the one above is in each case it occurs wrongly labeled
with an author relation with “http://svn-commit.rubyonrails.org/rails/trunk@5893”.

Each occurrence of the types of patterns described above lowers the precision as
it wrongly labels non-relations as relations. If such cases can be handled with human
intervention or simple rules to filter them out that would definitely increase precision.
This is however, nothing that Relex currently does and such capabilities would require
careful mapping of the patterns occurring in the data. Moreover, as these are recurring
patterns we assume they either belong to a third party software handling git commits or
a standard rigorously enforced by a team on a project. With that dependency there is
a risk that heuristics would fail with new or changed standards and with altered or new
third party software.

It should be mentioned that there are other types of more semantically sound patterns
which are successfully labeled by Relex. Below are three samples of one such pattern.

“Merge branch ’master’ of github.com:akka/akka”

“Merge pull request #370 from Memphiz/airplay”

“Merge pull request #848 from SuprDewd/issue 82”

Each of these contain two author relations and one entity relation. The entity rela-
tion and one of the author relations are false relations extracted by the relation candi-
date identificator. All of the relations are correctly labeled by Relex, despite the pre-
processing failing. Contrary to the wrongly labeled patterns from above these patterns
are semantically interpretable for a human. When that is not the case no supervised
relation extractor can be faulted for mislabeling.

6.1.4 Language use

Most of the causes for wrong annotations seen so far can be attributed to the specific
use of syntax and terminology in the GHTorrent data that differ from how well formed
English sentences would be written. To extend that argument there is also a use of
language that differs from how it would be done in the ConLL training data but which is
hard to label as faulty English per se. Sentences such as the one below contribute greatly
to low precision even though all the tokens are labeled with the correct part-of-speech.

“(mono save args): Use ARGSTORE instead of TEMPSTORE to handle
soft float correctly.”

28


6.2. ALTERNATIVES CHAPTER 6. DISCUSSION

Here an entity relation has been annotated between“mono save args”and“ARGSTORE”
with the relation phrase “Use”. There are however no less than four other relations
wrongly classified as true.

1. Entity relation between “ARGSTORE” and “TEMPSTORE”
The relation between“ARGSTORE”and“TEMPSTORE” is relative to the relation
between “mono save args” and “ARGSTORE”. Thus, if n-ary or nested relations
were being elicited this would be a valid relation.

2. Entity relation between “mono save args” and “TEMPSTORE”
There is no explicit binary relation between these two entities. It is interesting to
note however, that there would be an implicit relation between “mono save args”
and “TEMPSTORE” with a relation phrase on the form “not use”

3. Author relation with “ARGSTORE”
In this instance we do not have an annotation saying that the author is using
“ARGSTORE” but instead that “mono save args” is.

4. Author relation with “TEMPSTORE”
Similarly to the previous case, in this instance we do not think that the author is
using “TEMPSTORE” but instead that “mono save args” is. Again if there would
be a relation between author and “TEMPSTORE” the relation would be implicit
with a relation phrase on the form ”not use”.

This is more likely a case of the training data differing greatly from the GHTorrent
domain. The commit message above contributes severely to low score while there are
others which contribute more lightly.

“Start login shell (fixes #59 github issue) (per Austin Clements)”

In the sentence above Relex incorrectly finds an entity relation between “login shell” and
“#59”. This case also illustrate the domain specific language use in the GHTorrent data,
with differing syntax and vocabulary compared to the more standard English language
in the ConLL data.

6.2 Alternatives

So far in our discussion we have been pointing to different possibilities why Relex perform
poorly on the GHTorrent data. It would therefore be interesting to revisit the design
choices and evaluate how other approaches may aid relation extraction on this type of
data.

The main problems have been a poor match between the system and the data. With
the sentence detector, part-of-speech tagger, and binary classifier all failing due to mis-
match between model and data it begs to question if the system should be using other
models or rely on algorithms not reliant on models at all.

29


6.3. OVERALL RESULTS CHAPTER 6. DISCUSSION

In the case of pre-processing for such annotations as sentences, tokens, and part-of-
speech there are to our knowledge no models trained specifically for Git commit messages.
The other models we have found are either very specific to their domain or generic to
well formed English sentences, in which case they are as unfitted for the GHTorrent
data as the ones used in this study. We therefore assess the probability of finding models
suitable for these purposes as close to zero. Another option is training such models for
the domain. This is not only time consuming and expensive but may also be futile.
The way language use differ between projects, third party software, and even individuals
may make it impossible for a system to find clear signals in training data. Remains does
training models for specific projects or individuals which we dismiss based on cost and
time consumption alone.

When it comes to the relation extraction itself both the binary classifier and the
relation phrase extractor could benefit from models trained on hand-annotated, domain
specific data. For the relation phrase extractor a machine learning algorithm such as
conditional random fields would be needed. An issue with this approach is that features
available for machine learning, functional on the domain specific data, would still need
domain specific models. Another problem is difficulties for an annotator to comprehend
what the commit message author is trying to convey when using syntax and terminology
not know to the annotator, as is the case when language strays too far from passable
English.

Other options are to use different features and algorithms, preferably such which
are not sensitive to noisy data. A possibility would be less supervised machine learning
algorithms, although these are generally less successful compared to supervised methods.
There are middle ground with semi-supervised, self-supervised, and distantly supervised
alternative which have shown previous success for NLP. The latter of these would be
promising considering a Git repository is essentially a source of structure data potentially
useful for distant supervision. Another option which have gotten much attention lately
would be deep learning. Deep learning is a mainly unsupervised technique which finds
patterns in raw data without the use of intermediary features.

6.3 Overall results

Even if we focus on the best of the results, the NP-NP model applied to finding author
relations, the relations that are extracted are more incorrect ones than correct ones.
Looking at the entity relations the same metric, precision, is well under 20 %. In all
combinations of models and relation types a clear decline in both recall and precision is
evident compared to the corresponding performance on the well formed English sentences
from the ConLL data. The same kind of decline is present when applying the rule based
relation phrase extractor which is very successful on the ConLL data with accuracy
reaching 99 % but only reaching 65 % on the GHTorrent relations.

Revisiting the introduction and looking at our motives for this study we can recall
that the output is candidate for use as decision support in industry and to gather data
in research. The quantitative results from the experiments coupled with what we have

30


6.3. OVERALL RESULTS CHAPTER 6. DISCUSSION

seen throughout the rest of our discussion there are clear indications that Relex in its
current form is not ideal for the problem. Neither can a researcher or a decision maker in
a software development process be bothered to trust data that is likely to be misleading.

Thus, judging only by the results themselves, Relex would seem like an unsuitable
match to handle relation extraction on Git commit messages. As for the relation ex-
traction and natural language field in general we have seen that models trained on data
drawn from well formed English sentences does not perform satisfactory. This seems
to be due to a domain specific language used including different syntax and different
vocabulary that causes the analysis to fail. In order to use the techniques and methods
offered by NLP domain specific models need to be trained for each individual compo-
nent. However, due to variations in syntax and terminology use between projects and
individuals not even this is sure to produce a functional model. Other options include
resorting to lower degrees of supervision which traditionally perform worse than their
fully supervised counterparts. We would say that in its current state and without spe-
cific tailoring to the domain, relation extraction is not ready to be applied to the task
of extraction relations between entities in Git commit messages.

31


7
Threats to validity

7.1 Conclusion validity

The results produced in this work are done so using our own software. Using the pipeline
analogy we use open source software to help set up the architecture and as boilerplate
code for the evaluations. Moreover, open source software is used for all of the pre-
processing pipes. For the pipes belonging to the relation extraction itself we have made
our own implementations. Faulty implementations of any of the components, our own or
open source, would erode the integrity of the results. For this reason we maintain com-
plete transparency and all of the software used in this thesis can be found for inspection
through our software repository1.

No significance tests are made on the difference between results of different tests made
in the study. Additionally, the number of entity relations is rather small. However, the
differences in performance between applying Relex to ConLL data and GHTorrent data
is big and the number of samples is large. Therefore, the qualitative conclusions made
in the thesis should be valid.

7.2 Internal validity

The entire study is made by the author alone including annotations of data which are
very open for interpretation. The entire annotation project is available online 2 and two
lists of guidelines for annotation of entities and relations are provided as appendices A
and B respectively. The study would definitely be strengthen by more annotators and
an inter-judge reliability measurement for annotations.

1https://github.com/Rikard-Andersson/relex
2https://github.com/Rikard-Andersson/GHTorrent-Brat

32

https://github.com/Rikard-Andersson/relex
https://github.com/Rikard-Andersson/GHTorrent-Brat


7.3. EXTERNAL VALIDITY CHAPTER 7. THREATS TO VALIDITY

7.3 External validity

Only public git repositories from GitHub is included in the study and this selection is
further limited by only the 90 most popular projects. We can however generalize the
invalidation of the methods to the population based on a case. If it does not work on a
case it does by definition not work on the entire population.

The experiments are made with one system which is a weakness and a threat to
external validity since results may not generalize to the entire field. It is mitigated
by Relex being a system that is inspired with properties from a large number of other
systems in the field. They all carry similar properties. We also provide an analysis
of results which is intended to extend to the general case. This analysis is however
qualitative. Any conclusions to the generic case must be considered with this in mind.

7.4 Construct validity

In the case where Relex is not a good representative of a relation extraction system the
benchmarking of its performance between different domains may not be representative of
what we are trying to achieve. Not being able to compare Relex to any similar systems
is hence a threat to validity.

33


8
Conclusion

Relation extraction is used as a solution to structure unstructured data in the form of free
text. One source of such data is software repositories but the solution space of relation
extraction remains largely untested for this type of data. This thesis has therefore
investigated how well supervised relation extraction systems for natural language can be
applied to git commit messages.

In order to do this we have conducted an experiment where git commit messages
were extracted from public repositories on GitHub. These commit messages were then
annotated with entities and relations of interest and fed to a relation extraction system.
The results were evaluated and discussed both quantitatively and qualitatively.

The results from the experiments showed that the system had severe performance
issues where the best case scenario involved a mere 25 % precision and 55 % recall. It was
shown that this is a steep decline from the systems performance on well formed English
sentences from newswire data. This decline can be attributed to the models upon which
both the relation extraction system relies but also the models of pre-processing used.
These models does not function well on the noisy syntax and grammar of git commit
messages which is further complicated by a vocabulary specific to the domain.

From this we conclude that supervised relation extraction out-of-the-box can not be
used on git commit messages due to lack of domain specific models.

The learnings from this work should be interpreted in the light of the effort that was
made to achieve them. In this work existing models have been applied without much
alterations. As discussed in Section 1, Life sciences have an entire research field within
which use of existing natural language processing technologies are explored and new
adaptations are made. This success could certainly be repeated for Software Engineering
and data such as git repositories if the effort is made to build domain specific adaptations
of techniques, tools, and models for natural language processing.

Other options include exploring the use of different levels of supervision in the ma-
chine learning algorithms. There is reason to believe that distantly supervised models

34


CHAPTER 8. CONCLUSION

would work well since the text data of git commit messages is closely coupled with a
structured source of data inherent in the git repositories themselves. Moreover, semi-
supervised and self-supervised algorithms are popular within open information extraction
indicating such a direction as a possible improvement. Lastly, unsupervised algorithms
and deep learning with deep neural networks are getting a lot of attention lately, an area
which could prove useful for application discussed in this thesis.

35


Bibliography

[1] Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large
plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Li-
braries, DL ’00, pages 85–94. ACM, 2000.

[2] Nasir Ali, Yann-Gael Gueneuc, and Giuliano Antoniol. Trustrace: Mining software
repositories to improve the accuracy of requirement traceability links. Software
Engineering, IEEE Transactions on, 39(5):725–741, 2013.

[3] Rikard Andersson. Thesis release. Aug 2014. doi: {10.5281/zenodo.11200}. URL
http://dx.doi.org/10.5281/zenodo.11200.

[4] Michele Banko and Oren Etzioni. The tradeoffs between open and traditional re-
lation extraction. In Proceedings of ACL-08: HLT, pages 28–36, Columbus, Ohio,
June 2008. Association for Computational Linguistics. URL http://www.aclweb.

org/anthology/P/P08/P08-1004.

[5] Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and
Oren Etzioni. Open information extraction from the web. In Proceedings of the
20th International Joint Conference on Artifical Intelligence, IJCAI’07, pages 2670–
2676, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. URL
http://dl.acm.org/citation.cfm?id=1625275.1625705.

[6] Sergey Brin. Extracting patterns and relations from the world wide web. In
The World Wide Web and Databases, volume 1590 of Lecture Notes in Com-
puter Science, pages 172–183. Springer Berlin Heidelberg, 1999. URL http:

//dx.doi.org/10.1007/10704656_11.

[7] Krishna Kumar Chaturvedi, V B Singh, and Prashast Singh. Tools in mining
software repositories. In Computational Science and Its Applications (ICCSA),
2013 13th International Conference on, pages 89–98, June 2013.

[8] Tse-Hsun Chen, Stephen W Thomas, Meiyappan Nagappan, and Ahmed E Hassan.
Explaining software defects using topic models. In Mining Software Repositories
(MSR), 2012 9th IEEE Working Conference on, pages 189–198, June 2012.

36

http://dx.doi.org/10.5281/zenodo.11200
http://www.aclweb.org/anthology/P/P08/P08-1004
http://www.aclweb.org/anthology/P/P08/P08-1004
http://dl.acm.org/citation.cfm?id=1625275.1625705
http://dx.doi.org/10.1007/10704656_11
http://dx.doi.org/10.1007/10704656_11


BIBLIOGRAPHY BIBLIOGRAPHY

[9] Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. An analysis of
open information extraction based on semantic role labeling. In Proceedings of the
sixth international conference on Knowledge capture, pages 113–120. ACM, 2011.

[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.

[11] Aron Culotta and Jeffrey Sorensen. Dependency tree kernels for relation extrac-
tion. In Proceedings of the 42Nd Annual Meeting on Association for Computational
Linguistics, ACL ’04. Association for Computational Linguistics, 2004.

[12] Aron Culotta, Andrew McCallum, and Jonathan Betz. Integrating probabilistic
extraction models and data mining to discover relations and patterns in text. In
Proceedings of the Main Conference on Human Language Technology Conference
of the North American Chapter of the Association of Computational Linguistics,
HLT-NAACL ’06, pages 296–303. Association for Computational Linguistics, 2006.

[13] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam
Mausam. Open information extraction: The second generation. In Proceedings of
the Twenty-Second international joint conference on Artificial Intelligence-Volume
Volume One, pages 3–10. AAAI Press, 2011.

[14] Peter Exner and Pierre Nugues. Entity extraction: From unstructured text to
dbpedia rdf triples. In Proceedings of the Web of Linked Entities Workshop in
conjuction with the 11th International Semantic Web Conference, pages 58–69, 2012.

[15] Georgios Gousios. The ghtorrent dataset and tool suite. In Proceedings of the 10th
Working Conference on Mining Software Repositories, MSR ’13, pages 233–236,
Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4673-2936-1. URL http:

//dl.acm.org/citation.cfm?id=2487085.2487132.

[16] Ralph Grishman and Beth Sundheim. Message understanding conference-6: A brief
history. In Proceedings of the 16th Conference on Computational Linguistics - Vol-
ume 1, volume 96, pages 466–471, 1996.

[17] Maria Halkidi, Diomidis Spinellis, George Tsatsaronis, and Michalis Vazirgiannis.
Data mining in software engineering. Intelligent Data Analysis, 15(3):413–441, 2011.

[18] Ahmed E. Hassan and Tao Xie. Software intelligence: The future of mining software
engineering data. In Proceedings of the FSE/SDP Workshop on Future of Software
Engineering Research, FoSER ’10, pages 161–166. ACM, 2010.

[19] Chen Jinxiu, Ji Donghong, TanChew Lim, and Niu Zhengyu. Automatic relation
extraction with model order selection and discriminative label identification. In
Robert Dale, Kam-Fai Wong, Jian Su, and OiYee Kwong, editors, Natural Language
Processing – IJCNLP 2005, volume 3651 of Lecture Notes in Computer Science,
pages 390–401. Springer Berlin Heidelberg, 2005.

37

http://dl.acm.org/citation.cfm?id=2487085.2487132
http://dl.acm.org/citation.cfm?id=2487085.2487132


BIBLIOGRAPHY BIBLIOGRAPHY

[20] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional
random fields: Probabilistic models for segmenting and labeling sequence data. In
Proceedings of the Eighteenth International Conference on Machine Learning, ICML
’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers
Inc.

[21] Julie Medero, Kazuaki Maeda, Stephanie Strassel, and Christopher Walker. An
efficient approach to gold-standard annotation: Decision points for complex tasks.
In Proceedings of the Fifth International Conference on Language Resources and
Evaluation, volume 6, pages 2463–2466, May 2006.

[22] Yuval Merhav, Filipe Mesquita, Denilson Barbosa, Wai Gen Yee, and Ophir Frieder.
Extracting information networks from the blogosphere. ACM Transactions on the
Web (TWEB), 6(3):11, 2012.

[23] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for
relation extraction without labeled data. In Proceedings of the Joint Conference of
the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011.
Association for Computational Linguistics, 2009.

[24] David Nadeau and Satoshi Sekine. A survey of named entity recognition and clas-
sification. Lingvisticae Investigationes, 30(1):3–26, 2007.

[25] Johan Natt och Dag, Björn Regnell, Vincenzo Gervasi, and Sjaak Brinkkemper. A
linguistic-engineering approach to large-scale requirements management. Software,
IEEE, 22(1):32–39, Jan 2005.

[26] Joakim Nivre, Johan Hall, and Jens Nilsson. Maltparser: A data-driven parser-
generator for dependency parsing. In Proceedings of the Fifth International Con-
ference on Language Resources and Evaluation, volume 6, pages 2216–2219, May
2006.

[27] Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An
annotated corpus of semantic roles. Computational linguistics, 31(1):71–106, 2005.

[28] Daniel Pletea, Bogdan Vasilescu, and Alexander Serebrenik. Security and emotion:
Sentiment analysis of security discussions on github. In Proceedings of the 11th
Working Conference on Mining Software Repositories, MSR 2014, pages 348–351.
ACM, 2014.

[29] Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph
Weischedel, and Nianwen Xue. Conll-2011 shared task: Modeling unrestricted
coreference in ontonotes. In Proceedings of the Fifteenth Conference on Compu-
tational Natural Language Learning: Shared Task, CONLL Shared Task ’11, pages
1–27, Stroudsburg, PA, USA, June 2011. Association for Computational Linguistics.
ISBN 9781937284084.

38


BIBLIOGRAPHY BIBLIOGRAPHY

[30] Benjamin Rosenfeld and Ronen Feldman. Clustering for unsupervised relation iden-
tification. In Proceedings of the sixteenth ACM conference on Conference on infor-
mation and knowledge management, pages 411–418. ACM, 2007.

[31] Per Runeson, Magnus Alexandersson, and Oskar Nyholm. Detection of duplicate
defect reports using natural language processing. In Software Engineering, 2007.
ICSE 2007. 29th International Conference on, pages 499–510, May 2007.

[32] Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, et al. Open lan-
guage learning for information extraction. In Proceedings of the 2012 Joint Con-
ference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, pages 523–534. Association for Computational Linguis-
tics, 2012.

[33] Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou,
and Jun’ichi Tsujii. Brat: A web-based tool for nlp-assisted text annotation. In
Proceedings of the Demonstrations at the 13th Conference of the European Chap-
ter of the Association for Computational Linguistics, EACL ’12, pages 102–107,
Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

[34] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning.
Multi-instance multi-label learning for relation extraction. In Proceedings of the
2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, EMNLP-CoNLL ’12, pages 455–465,
Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

[35] Ferdian Thung, Tegawende F. Bissyande, David Lo, and Jiang Lingxiao. Network
structure of social coding in github. In Software Maintenance and Reengineering
(CSMR), 2013 17th European Conference on, pages 323–326, March 2013.

[36] Jason T. Tsay, Laura Dabbish, and James Herbsleb. Social media and success in
open source projects. In Proceedings of the ACM 2012 Conference on Computer
Supported Cooperative Work Companion, CSCW ’12, pages 223–226, New York,
NY, USA, 2012. ACM.

[37] Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. An approach to
detecting duplicate bug reports using natural language and execution information.
In Proceedings of the 30th International Conference on Software Engineering, ICSE
’08, pages 461–470, New York, NY, USA, 2008. ACM.

[38] Fei Wu and Daniel S Weld. Open information extraction using wikipedia. In Proceed-
ings of the 48th Annual Meeting of the Association for Computational Linguistics,
pages 118–127. Association for Computational Linguistics, 2010.

[39] Leon Li Wu, Boyi Xie, Gail E Kaiser, and Rebecca Passonneau. Bugminer: Software
reliability analysis via data mining of bug reports. In Proceedings of the 23rd Inter-
national Conference on Software Engineering and Knowledge Engineering (SEKE),
pages 95–100. Department of Computer Science, Columbia University, July 2011.

39


BIBLIOGRAPHY

[40] Ying Xu, Mi-Young Kim, Kevin Quinn, Randy Goebel, and Denilson Barbosa. Open
information extraction with tree kernels. In Proceedings of the 2013 Conference
of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 868–877, June 2013.

[41] Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. Structured
relation discovery using generative models. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing, pages 1456–1466. Association
for Computational Linguistics, 2011.

40


A
List of entity types

Files A file in a file system

Directories Any directory in a file system

Classes Referring to the object oriented programming concept of a class

Object An instantiation of a class

Methods Referring to methods and functions in programming

Variables Referring to the concept in programming

Parameters Very similar to variables but often used with reference to such provided
in method calls.

URL A URL

Resources A generic type to collect other types of resources used in development or in
code

Issues An issue, a bug report, a bullet point on a todo list

Commits The git concept of a commit

Branches The git concept of a branch

Pull request A type of issue on github

Products Products names such as iOS, iPhone, Android, MySQL

Technical concept/Tech Misc. Entities which do not fit under other to catch a tech-
nical concept

41


APPENDIX A. LIST OF ENTITY TYPES

Version A version name of a product

Tests A test case

42


B
Annotation guidelines

1. Relations have an explicitly stated relation phrase that occur between the two
entities

2. The relation must occur directly between two entities if there is a mediating noun
between, leave the relation.

(a) <E1> done something to something that is related to <E2>

(b) <E1> done something to something because of something in <E2>

(c) “added support for iPhone” - Not a relations since the author relation is not
directly between iPhone and the author but rather between the author and
support.

(d) “Dropped support for PostgreSQL < 8.4” - Same as above

3. Author relation

(a) Instance where the author (not mentioned in the commit message) have done
something to an entity (mentioned in the commit message).

(b) Instance where a commit or the application of a commit (not mentioned in
the commit message) have done something to an entity (mentioned in the
commit message)

4. If something is ambiguous (you are not sure about the actual relation) leave it out

5. Ignore chunks of code

43


C
ConLL 2011 shared task data

example

In Table C.1 is an excerpt from the tables provided in the ConLL 2011 shared task as
training and testing data. The table is incomplete and originally provides additional data
about each sentence which is left out for simplicity. Each table can contain additional
columns for ProBank-style relations where the example in Table C.1 only has two.

44


APPENDIX C. CONLL 2011 SHARED TASK DATA EXAMPLE

Token Part-of-speech Named entity Relations

Rear NNP * (ARG0* (ARG1*

Admiral NNP * * *

Mark NNP (PERSO