Statistical parsing and unambiguous word
representation
in OpenCog’s Unsupervised Language Learning project

Master’s thesis in MPALG and MPCAS

CLAUDIA CASTILLO DOMENECH and ANDRES SUAREZ
MADRIGAL

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2018


Master’s thesis 2018

Statistical parsing and unambiguous word
representation

in OpenCog’s Unsupervised Language Learning project

CLAUDIA CASTILLO DOMENECH
ANDRES SUAREZ MADRIGAL

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2018


Statistical parsing and unambiguous word representation
in OpenCog’s Unsupervised Language Learning project
CLAUDIA CASTILLO DOMENECH
ANDRES SUAREZ MADRIGAL

« CLAUDIA CASTILLO DOMENECH and ANDRES SUAREZ MADRIGAL,
2018.

Supervisor: Luis Nieto Piña, Faculty of Humanities
Advisor: Ben Goertzel, Hanson Robotics Ltd.
Examiner: Morteza Haghir Chehreghani, Data Science division

Master’s Thesis 2018
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2018

iv


Statistical parsing and unambiguous word representation
in OpenCog’s Unsupervised Language Learning project
CLAUDIA CASTILLO DOMENECH and ANDRES SUAREZ MADRIGAL
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The work presented in the current thesis is an effort within the larger project enti-
tled “Unsupervised Language Learning”, aiming to build a system that can learn the
grammar of a language by processing a corpus of unnanotated text. In particular,
the authors focused on the first steps of the pipeline, including the selection and
pre-processing of useful text collections to build and test its performance, as well as
the syntactic learning loop in which the system obtains statistics from sentences in
a given corpus and leverages them to implicitly learn the syntactic structure of the
corpus language. The process gathers statistics from co-occurrence of words in a
sentence using different counting methods and estimates the pair-wise mutual infor-
mation between word pairs. Using this knowledge and using a minimum spanning
tree algorithm, the system automatically produces syntactic parses of the ingested
sentences. The unsupervised parses, when compared to human-generated or rule-
based standards, show a varying quality; however, they always perform better than
a baseline of randomly generated parses, implying that the system is indeed learning
some of the assumed underlying linguistic content from the texts. From the results
we learned that the complexity of the input text is a determining factor on the
method that performs best, leading us to conclude that a successful unsupervised
parser should be able to, up to some extent, pre-assess this complexity before pro-
cessing. Also, the outputs of our different parser methods all show that accounting
for distance among word pairs when parsing yields better results. Nonetheless, to
get a more confident evaluation on this implication it is important to have a stan-
dard for comparison that bases itself on the same model assumptions. Additionally,
we implemented a disambiguation process based in AdaGram as a way to build dis-
tinct representations for different word senses within a corpus, which then annotates
the corpus with tags representing different uses of each word. The purpose of this
pre-processing step is to break polysemy in the corpora and provide a cleaner input
to the parsing pipeline. We report that our experiments show either slight improve-
ment in parse quality, or no significant change, if disambiguated corpora are used
as input.

Keywords: Unsupervised, language learning, parsing, disambiguation, computer sci-
ence, engineering, project, thesis.

v


Acknowledgements
First of all, we want to thank the teams at Hanson Robotics and OpenCog for
welcoming us in Hong Kong to contribute to their visionary goals and for sharing
their knowledge, time, and friendship with us. In particular, we are grateful to Ben
Goertzel for his insights and for making the collaboration between the university
and the company viable, and to Anton Kolonin for his constant organization of our
efforts and for being always ready to propose a new experiment.

Second, we want to recognize the contribution of our supervisor, Luis Nieto Piña,
in making this thesis possible. We greatly benefited from his opportune and proper
advice during the different stages of the project, as well as from his meticulous revi-
sions of our manuscript, even during the rushed last days before submission. In all:
¡Gracias, Luis!

On the personal side, we have also received invaluable aid that we want to acknowl-
edge:

Andrés: I am thankful for the unending support of my parents, brother and sister-
in-law. I am also grateful to my friend Iuliia, who was always present during the
good and less-good times of the past year.

Claudia: I want to thank my family for their unceasing support during my studies. I
am especially grateful to my aunt María Sol, my brother Carlos and his wife Andrea,
for not only hosting me in their homes, but also taking care of me, encouraging me
and being my strength in the most critical moments.

Gothenburg, October 2018

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 The Unsupervised Language Learning Project . . . . . . . . . . . . . 2

1.1.1 Unsupervised Syntactic Analysis . . . . . . . . . . . . . . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Related Work 9
2.1 Language models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Lexical attraction . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1.2 Mutual Information . . . . . . . . . . . . . . . . . . 11

2.1.2 Maximal parsing algorithm . . . . . . . . . . . . . . . . . . . . 13
2.2 AdaGram: learning disambiguated word representations . . . . . . . . 13

2.2.1 Word embeddings in Skip-Gram . . . . . . . . . . . . . . . . . 15
2.2.2 Multi-prototype AdaGram . . . . . . . . . . . . . . . . . . . . 16

2.3 OpenCog: an open source platform for AGI . . . . . . . . . . . . . . 17
2.3.1 Representation structures . . . . . . . . . . . . . . . . . . . . 19

2.3.1.1 Sentence representation . . . . . . . . . . . . . . . . 20

3 Methods 25
3.1 Controlled corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Pre-cleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Frequency counting . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Mutual information calculation . . . . . . . . . . . . . . . . . 31
3.2.3 MST algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Word-sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Word-sense disambiguation . . . . . . . . . . . . . . . . . . . 41

4 Results 45
4.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ix


Contents

4.1.1 POC-Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 POC-English without ambiguity . . . . . . . . . . . . . . . . . 49
4.1.3 POC-English with ambiguity . . . . . . . . . . . . . . . . . . 52
4.1.4 Child Directed Speech . . . . . . . . . . . . . . . . . . . . . . 54
4.1.5 Gutenberg Children . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.6 Brown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.7 Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Word-sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 WSD in POC-English . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 WSD in Gutenberg Children . . . . . . . . . . . . . . . . . . . 68

5 Conclusions 75
5.1 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Bibliography 79

x


List of Figures

1.1 Language Learning Project pipeline. . . . . . . . . . . . . . . . . . . . 2
1.2 Syntactic Learning Loop. . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Architecture of the Skip-Grammodel. Image credit: Leonardo Barazza
[1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Training sample selection in the Skip-Gram model. Image credit:
Leonardo Barazza [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 An example of a random parsing. . . . . . . . . . . . . . . . . . . . . 39
3.2 An example of a sequential parsing. . . . . . . . . . . . . . . . . . . . 39
3.3 F1 score metric: graphic interpretation of precision and recall. Image

credit: Walber [51]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Example of an incorrect parse. Corpus: POC-Turtle. Method: OpenCog
(all 6) against GS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Example of an incorrect parse without distance accounting in scoring
function. Corpus: POC-Turtle. Method: OpenCog (all 3) with FMI
against GS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Example of an incorrect parse without distance accounting in the
counting method. Corpus: POC-Turtle. Method: OpenCog Clique-
WIN (with both scoring functions) against GS. . . . . . . . . . . . . 49

4.4 Example of an incorrect parse in standard. Corpus: POC-English-
NoAmb. Method: LG-English (SS) against GS. . . . . . . . . . . . . 50

4.5 Example of a parse with incorrect linking of articles. Corpus: POC-
English-NoAmb. Method: OpenCog Clique-WIN + FMI against GS. 51

4.6 Example of a parse with incorrect linking of negation particles and
time adverbs. Corpus: POC-English-NoAmb. Method: OpenCog
Clique-WIN + FMI against GS. . . . . . . . . . . . . . . . . . . . . . 52

4.7 Example of a parse with compound verbs. Corpus: POC-English-
Amb. Method: LG-English (SS) against GS. . . . . . . . . . . . . . . 53

4.8 Example of a parse with syntactic ambiguity. Corpus: POC-English-
Amb. Method: LG-English (SS) against GS. . . . . . . . . . . . . . . 53

4.9 Example of an incorrect parse. Corpus: POC-English-Amb. Method:
OpenCog Clique-WIN + FMI against GS. . . . . . . . . . . . . . . . 54

4.10 Example of a parse excluding words and separating verbs with apos-
trophes. Corpus: CDS. Method: OpenCog (all 6) against LG-English
(SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xi


List of Figures

4.11 Example of a parse that doesn’t process apostrophes. Corpus: CDS.
Method: OpenCog Clique-WIN-dist + FMI-dist against LG-English
(SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.12 Example of a parse with different tokenization. Corpus: Guten-
berg Children. Method: OpenCog LG-ANY + FMI-dist against LG-
English (SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.13 Example of a parse with inline quotations. Corpus: Gutenberg Chil-
dren. Method: OpenCog Clique-WIN-dist + FMI-dist against LG-
English (SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.14 Example of a parse with a compound sentence. Corpus: Brown.
Method: OpenCog Clique-WIN-dist (2) and LG-Any (3) against SS
(1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.15 Example of a simple parse. Corpus: Penn Treebank. Method: LG-
Any with FMI-dist (3) against GS (1) and SS (2). . . . . . . . . . . . 65

4.16 Example of a complex parse. Corpus: Penn Treebank. Method: LG-
Any with FMI-dist (3) against GS (1) and SS (2). . . . . . . . . . . . 65

4.17 Examples and results for Model 1 and 2, trained for WSD in the
POC-English corpus. Ambiguous word senses, according to the GS,
are colored in the example sentences. . . . . . . . . . . . . . . . . . . 68

4.18 Linear effects for selected hyper-parameters on PWFS score for the
Gutenberg Children corpus. . . . . . . . . . . . . . . . . . . . . . . . 70

xii


List of Tables

4.1 Parsing results for POC-Turtle. . . . . . . . . . . . . . . . . . . . . . 46
4.2 Example of pair FMI calculation in POC-Turtle: word wing. . . . . . 47
4.3 Example of pair FMI calculation in POC-Turtle: word eagle. . . . . . 48
4.4 Example of pair FMI calculation in POC-Turtle: word herring. . . . . 49
4.5 Parsing results for POC-English without ambiguity. . . . . . . . . . . 50
4.6 Example of pair FMI calculation in POC-Turtle: verb + article. . . . 51
4.7 Parsing results for POC-English with ambiguity. . . . . . . . . . . . . 53
4.8 Parsing results for Child Directed Speech. . . . . . . . . . . . . . . . 55
4.9 Parsing results for Gutenberg Children. . . . . . . . . . . . . . . . . . 58
4.10 Parsing results for Brown. . . . . . . . . . . . . . . . . . . . . . . . . 63
4.11 Parsing results for Penn Treebank. . . . . . . . . . . . . . . . . . . . 64
4.12 Hyper-parameter effects on WSD scores for EnglishPOC. . . . . . . . 67
4.13 Parameters and scores for the best disambiguated POC-English models. 68
4.14 Parsing results for POC-English after disambiguation. . . . . . . . . . 69
4.15 Parameters and scores for the best disambiguated Gutenberg Chil-

dren model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.16 Spearman’s ρ correlation for different similarity scores for our best

disambiguated Gutenberg Children model. Results for all models but
Model 1 are taken from Huang et al. [27] . . . . . . . . . . . . . . . . 71

4.17 Parsing results for Gutenberg Children after disambiguation. . . . . . 73

xiii


List of Tables

xiv


1
Introduction

Language is perhaps the single most important feat that the human kind has
achieved in history. Unparalleled by any other species on Earth, early human verbal
communication offered a more efficient way to transmit ideas and abstract thoughts,
and share experiences with others, thus accelerating learning and the progress of
societies throughout the world. Current-day machines have their own internal lan-
guages, also invented by humans, and they’re quite efficient with it. However, those
differ considerably from natural languages (languages created by humans through
evolution, without intentional design) to the point that only a small percentage of
the population is fluent in them.

Natural Language Processing (NLP) is a multidisciplinary area of research that
strives to create systems that are able to communicate externally through the use
of human languages [29]. Using human language to express instructions to a com-
puter and obtain its results would greatly enhance its efficiency of use. This is a
non-trivial task, and although we already have decent speech recognition, sequential
translation, and question-answering systems (just to mention a few applications),
we still cannot make programs that fully comprehend language. Some even consider
it an AI-complete problem: a problem that requires self-conscious systems, or ‘true
AI’ to be solved [54].

Traditionally, approaches to tackle Natural Language Processing have depended on
a human-in-the-loop, either to write sets of explicit rules that allow a system to
disentangle the different parts of a language input, or to annotate text corpora for
the system to learn from. The complexity of human language makes a set of rules
or manual guidelines unlikely to cover all possibilities the language can offer, let
alone when the semantic layer is to be taken into account, which seems necessary
for a system to completely understand a language. Moreover, learning from text
corpora in a supervised manner requires lots of data to be pre-processed and labeled
before training, and unfortunately most of the annotating efforts have been centered
in English (and a few ohter widely spoken languages), which means that these
methods won’t have quite the same efficiency in most other languages. Therefore,
an unsupervised learning system that could derive meaning and learn any language’s
grammar (the structural rules present in that language) from unannotated corpora
appears to be an adequate approach to solving the problem.

1


1. Introduction

1.1 The Unsupervised Language Learning Project

The purpose of the Unsupervised Language Learning (ULL) Project1, developed
by researchers in Hanson Robotics2 and the OpenCog Project3, is to implement
the NLP pipeline proposed by Vepstas and Goertzel [49], aiming to achieve the
understanding of any language by training on unannotated text corpora. When
completed, the process is expected to unsupervisedly learn a language grammar, a
crucial tool needed for an NLP system to understand that language. The pipeline
is based on a built-in linguistic model framework that supposes the existence of
grammatical dependence, as well as syntactic and semantic structure in the language.
The different layers of the model exploit and learn such components of the language.
Figure 1.1 shows a high-level diagram of the described learning process.

Figure 1.1: Language Learning Project pipeline.

In the diagram we can see how the units of the proposed pipeline interact in a re-
cursive manner to learn the structures of each of the three language components,
and finally obtain logic induction from them, all starting from raw text. In partic-
ular, the syntactic analysis processes receive feedback after the semantic analysis is
performed. At the same time, learning the context of these structures can provide
some tools for refined abstractions. The objective is for these components to be
able to achieve a higher generalization of the structures present in the language. As
the authors of the proposed pipeline state, “this can be viewed as a form of ‘deep
learning’ at work: a higher, more abstract layer can serve to refine the correctness
of a shallower, more concrete layer”[49].

After enough iterations, these three units of the language learning process should
be capable of deriving “a system of relationships that display appropriate coherence
and that, when applied by an appropriate parser to the inputs from the previous
layer, will yield parse trees that reflect the information content of the input”[49]. By
‘enough iterations’ here it is meant a level of abstraction that will allow comparison
and interaction with a model of the physical world so that the language model serves
to represent it. Certainly, training all these in an unsupervised manner requires
a large corpus of text, but once trained, the system will be able to understand

1https://wiki.opencog.org/w/Language_learning
2www.hansonrobotics.com
3www.opencog.org

2

https://wiki.opencog.org/w/Language_learning
www.hansonrobotics.com
www.opencog.org


1. Introduction

previously unseen text by using the learned language structures and applying logic
induction.

1.1.1 Unsupervised Syntactic Analysis
As one can see in the diagram (Figure 1.1), the first component of the learning
process is the syntactic analysis. Although some implicit semantic analysis will be
performed as well, the main focus of this thesis is centered on this first component;
to highlight the difference we have used for it a different color from the other two
components in the diagram. The next section discusses in detail the topic of this
dissertation, but before getting into the specifics of it, we provide a short overview of
the original authors’ ideas about how to handle the extraction of syntax. In order to
understand the particulars of the learning process, it is necessary to first understand
the authors’ general approach behind all of the component learning loops.

The high-level algorithm of the pipeline can be described by the following steps:
A. Define words to be ‘objects’.
B. Look for correlations between objects.
C. Express the correlations in a convenient representation.
D. Cluster similar objects together into classes.
E. Define a new set of objects as the clusters obtained from the last step.
F. Repeat from B until certain level of abstraction is obtained.
G. Use definitions to apply logical induction on unseen data.

Observe that the ‘feedback’ to bootstrap the language components depends heavily
on steps B, C, and D. In the context of syntactic analysis, finding correlations (step
B) is exactly what a parser does. Thus, as the first steps in the ULL pipeline, the
system should be able to observe sentences from an unannotated corpus and statis-
tically induce a grammar from them, then obtain dependency parses with untagged
labels between their components. These parses can be subsequently fed as input to
the following learning layers, or continued to be processed to provide feedback for
itself. In this way, recursion can be done at different levels by either using the feed-
back from the same component, or by taking in refinements from other components
when redefining the ‘new objects’ accordingly.

Figure 1.2 shows how this recursion could be done in detail for the syntactic compo-
nent. In a first pass of the currently proposed approach, raw input text is tokenized
and separated into words, which will be our first ‘objects’. Then, Mutual Information
(MI) measurements are used to decide how likely it is for words to appear together
in a sentence (as an induction of a rough initial grammar). Finally, a Minimum
Spanning Tree (MST) algorithm uses this rough grammar based on word-pairs’ MI
to build a dependency parse tree for a sentence. In later passes of the algorithm,
these same settings can be used to process the newly defined ‘objects’ 4. Alterna-
tively, we could use a grammar derived from our own method in further steps and

4 For example, the new objects can be word-senses after disambiguation; or it can be words
tagged with its categories obtained from a previous pass through the layer of semantic analysis.

3


1. Introduction

Figure 1.2: Syntactic Learning Loop.

parse the input in a dictionary-based manner.

Once the first steps are able to unsupervisedly produce acceptable parses from raw
text, they will provide a database of dependent word pairs representing an enor-
mous amount of highly fine-grained information regarding the syntactic rules of the
language. The next natural step is organizing these relations to derive an improved
structure of the learned language, and producing from it a proper grammar that
could be comparable to hand-crafted rules like those used by, e.g. the Link Gram-
mar [46]5. This is equivalent to performing step D of the general algorithm; but
in order to do that we first need to pre-process these parses or correlations in an
appropriate way to allow for further organization and clusterization. In other words,
we need to perform step C first.

A proper representation of our objects is therefore needed. Taking into consideration
the information contained in the parses and the way most existing clusterization
methods work, a vectorization of our objects seems appropriate. This is because
having objects (words, word pairs or any other linguistic unit, in a first pass of the
pipeline) represented in a geometrical space makes future comparisons and calcula-
tions simpler than having to deal with categorical objects. A possible approach is

5 Link Grammar (LG) is a theory of syntax and morphology originally proposed by Davy
Temperley and Daniel Sleator (1991), and contributed upon by several developers. It parses a
sentence by directly connecting pairs of words instead of generating a phrase structure hierarchy
like constituents grammars do. It differs from a dependency grammar in that links between words
lack a head-dependent relationship, i.e. they are undirected. Further details as well as the parser
code are available at the main website: https://www.abisource.com/projects/link-grammar/

4

https://www.abisource.com/projects/link-grammar/


1. Introduction

to use dimensionality reduction based on context estimation: utilize a word embed-
ding process that learns to represent concepts, in this case our objects, as vectors
in a continuous multidimensional space6. This approach is convenient because most
natural languages have in average more than 100,000 words in their vocabulary, and
dealing with such high dimensionalities would be complex and time consuming, not
to mention computationally expensive. Thus, using these smaller object vectors,
it is possible to proceed with the next step: clustering of words into groups that
resemble lexical categories.

The idea with clustering is to reduce the amount of highly fine-grained syntactic
information obtained from the parser by extracting common patterns from it that
can derive into a grammar. A suitable approach in early iterations would be to
reduce the number of rules obtained from the parser by grouping together words
that behave in a similar way, i.e. that frequently appear linked to the same word(s).
The purpose of this categorization of words is to learn a grouping that could coarsely
correspond to part-of-speech (POS) classes. In later iterations, when dealing with
higher-level abstractions, the same grouping strategy could learn semantic relations
that could be again fed back to the parsing and categorization system, and perhaps
learn idioms and set phrases. Finally, the grammar rules obtained after clustering
can be used to re-process the raw text, this time using a rule-based parser, or they
can be fed as input to the following analysis component loops.

1.2 Contributions

The current thesis project strives to add to the first steps of this large NLP research
effort by, as mentioned before, concentrating on the syntactic loop only. So far, we
have presented a short summary of the proposal as it was originally stated by its au-
thors [49]7. It is important to note that the different pipeline components presented
so far follow thoroughly-studied hypotheses that have been previously presented and
partially tested by other authors, but had not been connected together to form a
complete pipeline as attempted by the current project. Certainly, much has changed
since the date of publication of the authors’ first article, at the academic level and
in relation to certain parts of the pipeline; and now we can use these advances to
contribute to its development even further. Here we would like to broadly describe
the actual implementations that our thesis project contributed with; but before we
can do that, we need to provide a quick description of the development conditions
of the general project before our participation.

When we joined the OpenCog project to work on the current thesis, there was al-
ready an alpha version of the parser with some experimental results, but it was not

6 It is expected that this type of object representation will implicitly learn some meaning related
to the objects. This is the only part of the syntactic loop that deals with semantics without receiving
feedback from the semantic loop.

7 For an alternative description of the project, check out the source code’s Readme file at
https://github.com/opencog/opencog/tree/master/opencog/nlp/learn

5

https://github.com/opencog/opencog/tree/master/opencog/nlp/learn


1. Introduction

suitable for our purposes8. First, there was a description (Readme file and Wiki
page) of the general project but no thorough documentation for each function and
script existed, so every new person interested in collaborating with the project had
to read through the entire source code to understand what each component did.
Second, the installation process of the pipeline in a new computer was complex and
there were some mistakes in the output of the parser in random runs that appeared
to be related to this installation process. Third, the results available from experi-
ments conducted with this alpha version of the parser were not properly organized,
it was unclear which specific corpora had been used as input in each case. Also, at
least some of the results were obtained using Wikipedia articles as the input text,
which did not provide good results given that the fact-informative nature of these
articles bring little variety in pronouns or action verbs, and contains large numbers
of proper names, foreign language words, dates, tables, etc., that hinder the learning
process and can lead to unusual deductions of grammar.

Given the development state of the project, it seemed like a logical way to contribute
to start by gathering useful corpora, implementing a pre-processor of data, revising
and amending the state of the alpha parser, and possibly working with word sense
disambiguation before parsing as part of the syntactic analysis loop. With this in
mind, we identified the need for five specific tasks to be developed:

1. Find or create adequate text corpora to test the efforts of the project.
2. Create a configurable pre-processor to simplify these corpora as needed.
3. Understand, modify and test the existing OpenCog-based parser to make it

useful to the project.
4. Integrate word disambiguation and representation learning algorithms (as part

of pre-processing).
5. Write proper documentation for the three previous tasks.

In particular, the question we aim to answer in this thesis project, by performing
the described tasks, is whether it is possible to learn the syntactic structures of
a language in an unsupervised manner and get results that are at least as good
as the existing supervised methods. We will try to achieve this by assuming that
each language has an innate dependency structure between word pairs and aim to
learn this structure from observing a lot of text. As part of the ULL project, our
ideas will be based on the authors original proposal, but we will use our knowledge
of the latest developments in the field to try to advance even further. Technically
speaking, the goals of this thesis project are to improve, combine and modify existing
algorithms (which may be used for other purposes) to work as a part of the proposed
pipeline. According to the results some extensions to this project are proposed in the
Conclusion chapter. Finally, one challenge we aim for is making these modifications
using graphs and tree-like structures in such a way that the resulting algorithm can
be applied to numerous other fields outside NLP.

8 One of the project contributors experimented with the other two parts of the syntactic loop
as well, but none of his results at that moment looked like something we could build upon.

6


1. Introduction

1.3 Scope and Limitations

This thesis project deals with a problem that is part of a bigger and rather ambitious
research effort: solving unsupervised language understanding. If such a research ef-
fort succeeded, the conceptual contributions to the scientific community would be
immensely valuable. Nowadays, developing an unsupervised language parser for
any language would already be a big milestone. We considered that solving the
problem of obtaining simple grammar and syntax structures and finding a proper
representation for them to improve a posterior grouping into lexical categories, is a
key piece necessary before experimenting with recursion. Achievement of favorable
results in this topic would be equivalent to overcoming the next obstacle, allowing
for further development in more abstract layers of language abstraction and under-
standing. Thus, this is the current focus of the OpenCog team: experimenting with
one forward pass through the syntactic loop only.

Moreover, as part of the OpenCog team research project, the results of this thesis
will be limited by one pass of the pipeline as well. It is important to keep in mind
that according to the authors’ proposal, refinements in learning are obtained through
recursion, so without recursion the outcome of a first pass through the pipeline is
expected to be of limited performance. For this reason, we will also limit our evalu-
ation standards, with the hope that once recursion is implemented in the future, our
method will succeed to surpass not only the state of the art of unsupervised meth-
ods but supervised methods as well. It is also relevant to mention that the scope of
our thesis does not involve the embedding nor clustering sections9. Therefore, we
found the experimentation with word-sense disambiguation before parsing to be a
convenient way of evaluating how the parsing outputs can vary if the input to the
pipeline (our ‘objects’) get refined.

Finally, because we won’t be able to harness the power of recursion during a single
pass of the learning algorithm, we decided to simplify the input data through a
pre-processing step, in order to understand better if the implemented ideas work
in a simple case, thus assessing their performance. Because of time restrictions, we
decided to focus on texts that have different complexity levels but that belong to the
same language. Since the most extensive supervised NLP efforts have been realized
in English, we restricted our experiments to English corpora as well, so that we have
a benchmark for comparison10.

Following the previous explanations, the structure of this paper is as follows:

• Chapter 2 introduces the related work relevant to our topic. It specifically ex-

9 Other members of the OpenCog team are working in parallel to us with these sections.
Unfortunately, because of time limitations their results will not be possible to publish or use for
comparison within this thesis.

10 One of our collaborators has tested the pipeline with Chinese, but his scripts require the text
corpora to be first processed by another algorithm that learns to split the text into words before
inputting to the pipeline, and this effort is outside of our thesis domain.

7


1. Introduction

plains the two main theories that our thesis will develop on, which are Yuret’s
unsupervised parser [55] and AdaGram word embedding [2]. It also gives an
overview of the OpenCog project.

• Chapter 3 presents the ways we will implement the ideas taken from the related
work in the OpenCog environment and what tools we will use to evaluate our
results. This chapter offers also a description of the text corpora we selected.

• Chapter 4 presents the results obtained from our experiments, mainly com-
parison charts and useful output examples.

• Chapter 5 summarizes our findings, gives recommendations of what could be
improved and suggests ways in which this project could be continued in the
future.

8


2
Related Work

The amount of research involving methods for natural language processing systems
is extensive. Nonetheless, as mentioned before, most of the existing approaches have
a high dependency on the input of human experts on the field. Klein and Manning
[31], more than a decade ago, summarized results from efforts towards learning a
grammar without human supervision, while exposing the difficulties with such a
task. When it comes to unsupervised learning of specific tasks (for instance syntax
parsing, word sense induction, or document translation), some attempts have been
made, but none of them can equal the results obtained by their supervised coun-
terparts, which most of the time are limited to one language, or even one context.
A more recent appraisal of the general task of language acquisition is offered by
Dupoux [19], where stress is made about using, at most, a limited amount of super-
vision and ‘wild’ data, similar to the one that children are exposed to.

In this context of new approaches, several unsupervised parsing algorithms that
make use of Harris’ distributional hypothesis [24] are available in the literature
[31][15][11], which leverage some linguistic knowledge (e.g. part-of-speech of words)
and are therefore not entirely unsupervised. To our knowledge, the state of un-
supervised parsing has not improved considerably in the last decade. Deniz Yuret
[55] built a truly unsupervised parser based in maximization of mutual information
between words, which did not perform as well as supervised approaches, but he
pointed out potential improvements. With respect to word embeddings, methods
to unsupervisedly learn dimensionally-reduced real-valued vector representations of
words have been proposed in the past [5] [36, 37] [39]. Algorithms like these would
prove useful for our current ULL purposes, as a way to learn representations for
words that would allow for word-sense disambiguation and refinement of the input
data.

The iterative Unsupervised Language Learning (ULL) pipeline originally introduced
by Vepstas and Goertzel [49] brings together generalizations and extensions of pre-
vious theories by these and other authors to tackle language learning. A complete
description of such theories is respectively cited in their article, but for the pur-
poses of this project, we will make a short review of the most relevant ones. Our
work makes use of algorithms for parsing and for word representation learning in an
unsupervised manner. This chapter presents the theory behind those algorithms.

9


2. Related Work

2.1 Language models

If we want a computer to ‘understand’ a human language, we first need to find
an adequate way to represent it. When it comes to modeling, a natural question
that arises is if a model can actually exist within such a complex process. The
idea that a distributional structure is present in both the written evidence of the
language and the speaker itself is old and well studied [24]. For example, several
experiments show that humans have different levels of associations between diverse
types of words, which can be measured by comparisons in their response times after
exposure to them [14]. Hence, the question is reduced to: which is the best model
given a particular context or purpose?

On the pursuit to answer this question, context-free grammars have historically been
used to describe the structure of sentences and words in a natural language. Their
name derives from the fact that it uses simple replacement rules which are always
applicable, independently of the context in which they are found. A lexical system
is a language model that describes the syntax or structure of the language in terms
of its basic primitives, like words or phonemes. Most of the lexical systems, and
in particular the ones upon which this project develops, are based on context-free
grammars.

The first lexical systems using probability to mathematically represent the struc-
tures present in language were the n-gram models [29], where an expectation value
was assigned to a word given the previous n-words in the sequence. These models
report fair performances when applied to particular contexts, but are mostly limited
to simple sentences. If we want a model to capture more complex relations, such
as the ones determined by compound sentences, or word sequences longer than the
n-window scope of n-grams, we need a different kind of model. Bengio et al. [5],
argue that one needs to learn the joint probability of the words in the sequence for
that purpose, and thus face the curse of dimensionality1. They introduce a neural
probabilistic model to solve this.

An alternative to the massive use of computational power that complex modeling
or neural networks require, is the use of independence assumptions. A dependency
grammar, for instance, bases its analysis on the direct relation between the linguis-
tic units, e.g. words, and assumes that more complex relations can be built upon
these. A parser that is based on a dependency grammar tries to assign a dependency
structure to every sentence. A dependency structure is a tree (in the graph sense)
composed of planar directed arcs that connect all the words in a sentence, in such
a way that each arc connects only two words at a time and no arc passes over the
root word [47]. These characteristics of dependency structures make it easier for the
parser to gather statistical information about the distribution of words relative to
each other.

1 This expression was introduced by Richard E. Bellman within the field of dynamic optimization
[4]. It the context of machine learning it refers to various problems, such as data sparsity or
computational deficiency, that derives from trying to analyze data in high-dimensional spaces.

10


2. Related Work

On the late 90’s, Yuret introduced a formalism based on dependency structures,
where the relation between individual pairs of words are the basic primitives of the
language [55]. He argued that basing the analysis of a language only on a fixed
amount of rules does not suffice because both syntax and semantics are needed to
fully understand the language, and that both complement each other. In his words,
“what makes humans good learners is not sophisticated learning algorithms but hav-
ing the right representations” [55]. In his PhD thesis he refers to studies showing
that children automatically learn the likelihood of the connection between concepts
from their experience with the world, and use this knowledge to break the ambiguity
in meaning, even without explicitly knowing any syntax rules. Moreover, he sustains
his posture that syntax alone is not enough given the single-dimensionality of lan-
guage: only a finite (limited) amount of different arrangements of words are possible,
giving rise to the one-to-many mapping between conceptual relations (meanings) and
syntactic relation (ordering of words).

2.1.1 Lexical attraction
For the purpose of this thesis, we chose to model language the same way Yuret
did when he presented his first unsupervised parser based on lexical attraction.
Lexical attraction models are a class of probabilistic language models founded on
the aforementioned formalism where the language’s ‘building blocks’ are the words.
Lexical attraction is defined as the measure of affinity between those building blocks,
i.e. “the likelihood that two words will be related in a given sentence” [55]. The
concept of lexical attraction is based on the information-theory concepts of entropy
and mutual information.

2.1.1.1 Entropy

In his Mathematical Theory of Communication, Shannon [45] proposes the idea of
understanding words in a sentence in a natural language as a sequence of tokens
which can be represented by a discrete random variable. He defines the entropy of
such variable as the average information content that each token in the language
can have. Mathematically entropy is calculated by the formula:

H = −
∑

pi log(pi)
where i ranges over the possible values of the random variable and pi is the prob-
ability of the variable taking the value i. Note that it is a weighted average of the
‘information content’ of each possible event, which in turn is a negative logarithmic
function (− log(pi)). Consequently, events that are less ‘likely’ to happen (i.e. have
a low pi) are considered to have more information content because they are less
expected.

2.1.1.2 Mutual Information

Mutual information (MI) is a measure of how much more information content can be
extracted from a random variable, given the knowledge of another random variable

11


2. Related Work

[17]. It is calculated as 2:

I(X, Y ) =
∑

x∈X,y∈Y

p(x, y) log p(x, y)
p(x)p(y)

Like entropy, mutual information is also a weighted average of a logarithmic function
of probabilities, but in this case the function has two random variables. Thus, mutual
information describes the dependency between those two variables. MI can actually
be expressed in terms of entropy:

I(X, Y ) = H(X)−H(X|Y ) = H(Y )−H(Y |X) = H(X) +H(Y )−H(X, Y )

These identities provide a more intuitive and Bayesian way of understanding MI. In
particular, it allows us to see that MI is symmetric and non-negative, and that it is
equal to zero if and only if the two random variables, X and Y , are independent of
each other.

Yuret defined lexical attraction as the likelihood of a syntactic relation [55]. MI
compares probabilities and is therefore a measurement of likelihood. In the equa-
tion above, if we take X to be the random variable related to one word appearing
in a certain position of a sentence and Y to be the same but for another word and
position in the sentence, then the mutual information between the two variables is a
measure of how much information content do the two words appearing in particular
order and place provide to the sentence structure. This particular order and place
is exactly what a syntactic relation encloses. In this sense, lexical attraction can be
understood as the mutual information captured in the syntactic relation between X
and Y .

Moreover, MI measures the average entropy between all the possible outcomes of
two random variables. However, in some cases we might be interested in quantifying
the information content of an individual pair of events. For example, if we want to
know what is the likelihood of two specific words in a vocabulary being related. In
such cases we are interested in what is called the point-wise mutual information
(PMI), also known as fractional mutual information (FMI) [14]:

I(x, y) = log p(x, y)
p(x)p(y)

Note that I(x, y) here is short for I(X = x, Y = y).

Unlike MI, PMI can be negative (I(x, y) < 0) and this happens when x and y are
in complementary distributions (for example, antonymous words) because p(x, y) is

2 It is important to comment about the notation used here. In this equation p(x), p(y), and
p(x, y) are short for p(X = x), p(Y = y), and p(X = x, Y = y) respectively; where X and Y are
different random variables with possibly different sample spaces. This means that even though by
symmetry p(X = x, Y = y) = p(Y = y, X = x), it is not necessarily true that p(x, y) = p(y, x).
The reason is that, to be consistent p(y, x) in this context is short for p(X = y, Y = x) and not
for p(Y = y, X = x); the first of which may not even be admissible.

12


2. Related Work

very small compared to p(x)p(y)). PMI can also be approximately zero (I(x, y) ∼ 0)
if there is no interesting relationship between the two events. In contrast, when there
exists a high association between the events x and y, their joint probability p(x, y)
will be much larger than observing them by chance or individually (p(x)p(y)); hence
their PMI will be positive (I(x, y) >> 0). Accordingly, in the previous example,
word pairs that appear frequently together will most likely have high PMI.

2.1.2 Maximal parsing algorithm

Yuret’s model of lexical attraction assumes that words with high lexical attraction
are likely to be syntactically related within a language. Taking the information-
theory approach, the probability of observing a sentence can be computed as the
product of the probabilities of observing each word (e.g. an unigram model); but
under the assumption of a dependency grammar model, words are not independent
of each other. Thus the probability of a sentence depends as well on the conditional
probabilities of the relationship between the words in that sentence. To avoid the
complexity of long term dependencies, Yuret shapes his lexical attraction model as a
Markov network, where each word depends on only one other word in the sentence,
not necessarily an adjacent one, but it’s independent of all the others. He then
shows that the total entropy of a sentence in such a model is completely determined
by the lexical attraction between the dependent words. In this way, because the
most probable language model is also the one that produces the lowest entropy, he
proves that “the search for a low entropy language model leads to the unsupervised
discovery of syntactic relations” [55].

Furthermore, upon this discovery he tackles language understanding through the
use of an algorithm for maximal parsing. The goal of the algorithm is to find the
dependency structure that assigns a given sentence a high probability based on the
lexical attraction. In other words, the algorithm’s objective is to find the parsing
tree (which by definition has no cycles and a planar structure) which assigns the
highest total mutual information to the words linked by the tree. He describes an
algorithm based on Viterbi that obtains the optimal solution in O(n5) time, where n
is the number of words in a sentence. He also proposes an approximation of his own
which uses the properties of planarity and cycle restriction and has a performance of
O(n2). This approximation algorithm does not always find the most likely parsing
tree and in some cases it might leave some words disconnected, but in practice it has
a good average performance and its simplicity in implementation makes it a good
candidate for training.

2.2 AdaGram: learning disambiguated word rep-
resentations

The impact caused by the deep learning and neural network revolution has reached
language learning, as well. By early 2000’s, Bengio et al. [5] introduced a system

13


2. Related Work

that learned distributed representations3 for words in an unsupervised way by using
their context. Inspired by statistical language models and the distributional hypoth-
esis [24], they obtained better results than the best existing n-gram models of the
time, even more efficient than Latent Semantic Analysis [18] and Latent Dirichlet
Allocation [8]. Their proposed neural network was able to learn both word repre-
sentations and a language model at the same time.

Improvements followed in the coming years. Particularly, Mikolov and colleagues’
Skip-Gram model [37, 36] grabbed the community’s attention because of its ability
to predict the context of a word with high computational efficiency, making it suit-
able for online processing of text. Skip-Gram stood out as well for the unexpected
semantic content embedded in the learned vectors: some algebraic operations on
word vectors give results similar to analogy tasks. For example, if one subtracts the
vector for the word ‘man’ from the one for ‘king’, and add the vector for ‘woman’,
the result is very close to the vector for ‘queen’.

A problem not addressed by Skip-Gram is that of ambiguous words. When a word
has different meanings that occur in different contexts, it has to average between
its learned representations and ends up dominated by the most common use of the
word. Hence, in the best case, a word will be correctly predicted from its dominating
context and ignored in the other ones. In a less fortunate situation, the resulting
embedding will not be properly predicted for any of the word sense’s contexts.

AdaGram (Adaptive Skip-gram) [2] is a neural network-based continuous represen-
tation learning algorithm based in the Skip-gram model which addresses the afore-
mentioned limitation. It is modified to build a different vector representation for
each of the significant word’s senses that appear in the training data, while still
offering speed of processing and meaningful representations. Various other neural
network architectures have been proposed to accommodate ambiguity, but they lack
in processing efficiency [27], require external knowledge like WordNet senses [13], or
allow only a limited/fixed number of senses per word [41, 48].

As part of the ULL pipeline, AdaGram4 will be used for two different purposes: to
disambiguate polysemous words, and to learn word representations that can subse-
quently be used for finding syntactic categories in the language. In this thesis we
will tackle only the first purpose. AdaGram takes unlabeled text as input and learns
different representations for each word in the vocabulary by training the network
using word pairs taken from the context of the target word, following the princi-
ples of the distributional hypothesis [24]. Our expectations for this thesis is that the

3 A distributed representation is a way of expressing an entity (in this case a word) with “a
pattern of activity distributed over many computing elements”, where each computing element is
involved in the representation of other entities as well [26]. When it comes to word embeddings,
computing elements are associated with the dimensions of the vector space, where each dimension
is related to a characteristic of the word (part of speech for instance). Distributed representations
are known for its computational efficiency because of its network structure with simple, neuron-like
computing elements.

4 The repository containing AdaGram can be found in https://github.com/sbos/AdaGram.jl

14

https://github.com/sbos/AdaGram.jl


2. Related Work

Figure 2.1: Architecture of the Skip-Gram model. Image credit: Leonardo Barazza
[1].

learned representations correspond to useful word senses, but in order to understand
how this can happen we need to go through the algorithm. First, we’ll introduce
how Skip-Gram works, and then the modifications included in AdaGram to offer
multiple prototypes per word.

2.2.1 Word embeddings in Skip-Gram
A great deal of the success of Skip-Gram comes from the simplicity of the network
used to build word prototypes (as embeddings are regularly called), which allows for
a low complexity during training and operation. The full details of the network’s
operations can be found in the given reference, but its architecture is shown in an
example in figure 2.1. The network is composed of a one-hot input layer of dimen-
sionality V that codifies the input word as an index, a fully-connected linear hidden
layer of size D, and a hierarchical soft-max output layer with V neurons. V here
corresponds to the size of the training vocabulary (10000 in the example) and D is
a hyper-parameter of the model (300 in the example) which determines the reduced
dimension (i.e. V >> D) of the learned representations.

For training, each word xi of the input text X (containing N words) is assigned a
context yi from its C neighboring words (another hyper-parameter) both preceding
and following it, all converted to an index in the range [1,V ] using a simple look-up
table built from the whole corpus. Then, the currently trained word is fed into the
network in one-hot encoding format, and one of the context words is used as the
training sample to evaluate the loss function; error backpropagation follows. This

15


2. Related Work

Figure 2.2: Training sample selection in the Skip-Gram model. Image credit:
Leonardo Barazza [1].

task is repeated for the rest of the context of this word, and the whole operation for
every word in the corpus. This constitutes one training epoch. Figure 2.2 illustrates
the training sample selection in the case of a context of size 4, where xi is shown
in blue, and yi is represented by the white rectangles. Notice how the size of the
context is smaller for words near the limits of the text.

As stated by Bartunov and colleagues [2], Skip-Gram’s training objective is to find
the parameters θ (the neural network’s weights) that maximize the likelihood of
context words Y , given the training words X. Reproducing here their equation 2,
that is:

p(Y |X, θ) =
N∏

i=i

C∏
j=1

p(yij|xi, θ).

The loss function is simply the mean squared difference between the hierarchical
soft-max values and the one-hot vector representing the expected context word at
each training step. Once the network is trained, Mikolov et al. use the parameters
between the input and hidden layers as continuous vectors to represent each of the
words.

2.2.2 Multi-prototype AdaGram
The Adaptive Skip-Gram model of Bartunov et al. introduces modifications to the
Skip-Gram architecture, aimed to allow a single word to potentially have multiple
embeddings that refer to its use in different contexts, or senses, thus offering a so-
lution to polysemy. These modifications basically amount to introducing T input
neurons per word (a hyper-parameter which specifies the maximum number of mean-
ings a word can have), with their respective connections to the unchanged hidden
layer. The output layer is not modified and holds only one neuron per word in the

16


2. Related Work

vocabulary, thus being trained on word samples given sense inputs.

Because not every word is expected to have the same number of senses, AdaGram
uses a Dirichlet Process allocation of new senses that ‘activates’ new prototypes
based on a prior distribution dictated in the stick-breaking style of Sethuraman
[44]. With these modifications, the AdaGram model is described by Bartnuov and
colleagues’ [2] equation:

p(Y, Z, β|X,α, θ) =
V∏

w=1

∞∏
k=1

p(βwk|α)
N∏

i=1

p(zi|xi, β)
C∏

j=1
p(yij|zi, xi, θ)

 ,
where α is an an extra hyper-parameter that, together with the frequency of a
word, influences the number of senses that word will have; Z = {zi}N

i=1 represents
the senses for all words; and β is a Dirichlet Process parameter. Accordingly, the
loss function and training updates are modified to accommodate these changes.

The reader is advised to review the original article for full details or an in depth
explanation of the network design. It is relevant for us to mention though, that
the training algorithm’s complexity scales only linearly with the hyper-parameter
T Because T represents the maximum number of possible representations for each
word, usual values of T are fairly small so the algorithm maintains its efficiency.

2.3 OpenCog: an open source platform for AGI

Inspired by the architecture and dynamics of a human brain, OpenCog is an open
source software framework that offers a diverse collection of cognitive algorithms
[38]. Ranging from perception and learning to creativity and socialization or mem-
ory and language, among others, the platform hosts a carefully selected combination
of algorithms that tackle different kinds of knowledge. The objective is to imitate
the environment and development of young human children by integrating tasks of
different complexity involved in all major aspects of human intelligence. Most of its
source code is written in C++, but in the attempt to make it scalable and robust,
it offers withal support for Scheme (guile) and Python bindings.

The mission of the OpenCog project is to achieve Artificial General Intelligence
(AGI); that is, developers of the project aim to mimic with computers the general
intelligence at the human level, and hope to one day go beyond it. In a shorter
term, the goal is to provide to the technological community with “code that can
be used (piece by piece or as a whole) to help make practical software applications
smarter” [38]. Even though the project is at an early stage of development, its lead
researchers5 have a long trajectory in the AI field. Ben Goertzel6, founder and leader
of the OpenCog project, has published a number of books and articles about AGI

5 https://opencog.org/leadership/
6 https://en.wikipedia.org/wiki/Ben_Goertzel

17

https://opencog.org/leadership/
https://en.wikipedia.org/wiki/Ben_Goertzel


2. Related Work

(term which he coined) [22, 52, 21, 23], leads the AI efforts in a handful of compa-
nies7, and organizes a yearly Conference on AGI8. What’s more, some of OpenCog’s
most robust algorithms are nowadays in use by more than 50 global companies,
including Huawei and Cisco [42].

At its core, the OpenCog project consists of a knowledge representation database,
the AtomSpace, and a series of assorted projects which are under development.
These last include, but are not limited to, natural language processing, attention
allocation, spatial and time data managers, embodiment and psychological states
modeling, and network server scripts (the cogserver)9. The AtomSpace provides
these projects with the work-space and tools necessary for learning and reasoning
algorithms to be developed. It comprises an in-RAM database and the associated
query/reasoning engine to fetch and manipulate the data on it. Data in the Atom-
Space is represented in the form of hyper-graphs10; ergo, the AtomSpace can be seen
as a type of graph database in which the query engine is in charge of ‘re-writing’
the graph system and the rule-engine undertakes the general ‘rule-driven inference’.

The vertices and edges of such hyper-graphs are called Atoms (hence the name of
the database), and can be used to represent not only data information but also pro-
cedures. Consequently, many of these hyper-graphs represent executable programs
as well as data structures. The ‘programming language’ created to manipulate the
atoms and construct the corresponding hyper-graphs is called the Atomese. This
language resembles a mixture between several programming languages such as SQL
(for query), Prolog (for logic), Lisp (for lambda expressions), and others, because of
their utilities. It was not meant to be used by humans, but it was rather designed
for machine automation.

Atoms can be understood as type instances in type theory or as symbols in many
programming languages. They are permanent and immutable, but they can have
a value or an interpretation attached to it, which in turn can be mutable. Multi-
ple, different values can be associated to a single atom, and each of them can be
accessed with a unique key. The triplet (atom, key, value) is called a Valuation.
Valuations normally have fleeting, changing values that indicate the truth or like-
lihood of the atom they are attached to, but they can also contain other kinds of
transient data. There are many different atom types as well, some used for basic
knowledge-representation and computer-science concepts. For example there exist
atoms representing relations, such as similarity, inheritance and subsets; atoms used
for logic, such as Boolean and, or, for-all, there-exists; along with many others.

7 Like singularitynet.io, hansonrobotics.com, mozi.ai, among others.
8 http://agi-conference.org/
9 Source code at https://github.com/singnet/opencog/

10 A hyper-graph is a generalization of a graph where each edge can connect more than two
vertices at a time.

18

http://agi-conference.org/
https://github.com/singnet/opencog/


2. Related Work

2.3.1 Representation structures

In the context of this thesis there are two atom types that are of particular in-
terest: Nodes and Links. These two types of atoms give the basis for all of the
representation structures that we use for statistical gathering of language text. Def-
initions follow, along with pieces of code that help to clarify concepts for curious
programmer-readers.

Nodes contain a name or label and a type, which cannot be changed after creation.
These two properties are the only ones they have and they form their identification
key. As any other atom, once a node is placed in the AtomSpace it is unique: there
can only ever be one node with a given (name, type) pair. If a second Node with
the same key was to be inserted in the AtomSpace, any valuations associated with
it would be merged with the ones from the previously inserted Node. The following
piece of code shows the class definition in C++ for a Node atom.

class Node : public Atom
{

private:
std::string name;

public:
Node(Type, std::string &);
const std::string& getName() const;
std::string toString() const;

};

Links are used to connect or associate together other atoms. Unlike nodes, links do
not have any name or label; they are identified uniquely by their type and contents.
The content of a link is a set of other atoms, which can be ordered or unordered,
and is called the link’s outgoing set. The C++ class definition for a Link atom is
shown in the following piece of code:

class Link : public Atom
{

private:
std::vector<Handle> _outgoing;

public:
Link(Type, const std::vector<Handle>&);

const std::vector<Handle>& getOutgoingSet()
{ return _outgoing; }

size_t getArity() const {return _outgoing.size(); }
std::string toString() const;
std::string toShortString() const;

};

19


2. Related Work

2.3.1.1 Sentence representation

In order to process natural language using the AtomSpace, it is necessary to have
ways of representing the sentence structure of a document using Atoms. In Opencog,
the sentence representation follows a hierarchical order, with sentences grouped
(through links) into documents, parses grouped into sentences, and word instances
grouped into parses11. There are also other association structures such as word in-
stances corresponding to a word node, or linkage between different word instances
(syntactic links), as explained with greater detail in section 3.2.

The notation used to define Atoms is the one for constructing them using Scheme
language. In a generic way it is as follows:

(Type "identifier@UUID" TV)
where the Type is the atom type, the identifier is a unique string which serves also
as a key handle, UUID is a 128-bit MD5 hash printed in ASCII to make it readable,
and TV is a truth value of the atom (a valuation as previously defined). Next, we
will introduce the most relevant sentence structures that will be used for statistical
gathering within our thesis.

• For documents:

Multiple sentences within a document are connected using a SentenceLink.
Each document is saved in a DocumentNode and every sentence is saved in
a SentenceNode. There is only one DocumentNode per document and one
SentenceNode per sentence.

(SentenceLink
(SentenceNode "sentence@UUID")
(DocumentNode "document@UUID")

)

This type of link only indicates if a given sentence is a part of the document
but it doesn’t indicate where it is located within the document. A Sentence-
SequenceLink indicates sentence order with a NumberNode. Note that ‘42’
is just an example here.

(SentenceSequenceLink (stv 1.0 1.0)
(SentenceNode "sentence@UUID")
(NumberNode "42")

)

• For sentences:

A ParseLink connects parses to their original sentences. A parse is saved in
a ParseNode which has a SimpleTruthValue associated to it. SimpleTruth-

11 Documentation available at https://wiki.opencog.org/w/Sentence_representation

20

https://wiki.opencog.org/w/Sentence_representation


2. Related Work

Values provide a ranking for such parses as a numerical confidence value v.
This is grant because some of the parsers that OpenCog supports can output
several parses per sentence, each one with an evaluation, which can be saved
as such confidence value.

(ParseLink
(ParseNode "sentence@UUID" (stv 1.0 v))
(SentenceNode "sentence@UUID")

)

• For words:

Each word in a vocabulary is represented with a WordNode, where the iden-
tifier is the word itself. Since the same word can appear multiple times within
documents or even sentences, a WordInstanceNode is used to represent
the unique occurrence of a word in a particular context. The identifier of a
WordInstanceNode is used to tell it apart from other instances of the word,
and can be therefore tagged with feature data such as tense, number, and
part-of-speech. Given a word instance, a ReferenceLink is used to associate
it with the underlying word.

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "wordinstance@UUID")
(WordNode "word")

)

• For parses:

Since different parses may assign different feature data to different word in-
stances, eachWordInstanceNode is associated with a particular parse by means
of a WordInstanceLink.

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "wordinstance@UUID")
(ParseNode "sentence@UUID")

)

As with SentenceLinks, this type of link only indicates if a given word instance
is a part of the parse, but it doesn’t indicate the structure within the parse.
To represent the word order within a parse we use a WordSequenceLink
with a NumberNode. Note again that ‘7’ is only an example here too.

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "word@UUID")
(NumberNode "7")

)

21


2. Related Work

• For word linkages:

As mentioned in section 1.1, Link Grammar (LG) is a sentence-parser that
creates undirected, planar link-graphs by connecting pairs of ordered words
that follow certain relationship rules. Such rules are defined in a language
grammar, where word-pair-links are grouped into categories and each link is
identified with relation to these categories. LG is integrated and greatly used
in OpenCog’s projects. Following the same structures used before, LG links
are represented in the Atomspace using an EvaluationLink predicate:

(EvaluationLink (stv 1.0 1.0)
(LinkGrammarRelationshipNode "Rel-typ")
(ListLink

(WordInstanceNode "left-word@UUID")
(WordInstanceNode "right-word@UUID")

)
)

Here Rel-typ is the name of the linkage yielded by the LG relation. Also, it
uses an ordered list because even though links are undirected, word order does
matter in most of the cases. Notice that once again word-instances are used
instead of words because LG might output more than one valid parse and
different parses can have different LG linkages.

With the purpose of illustrating these sentence representations we show an example
next. More examples can be found in the wiki page of the project [38]12. A text
document named example.txt containing only the following line:

I saw it here

would be processed and saved in the structures bellow:

(SentenceLink
(SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
(DocumentNode "exammpledoc@308a2446-046c-4f00-bf8d-7e4d2f256875"))

(SentenceSequenceLink (stv 1 1)
(SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")
(NumberNode "1"))

---------------------------------------------------------------------
(ParseLink (stv 1.0 0.83)

(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")
(SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d"))

12 https://wiki.opencog.org/w/Sentence_representation

22

https://wiki.opencog.org/w/Sentence_representation


2. Related Work

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "I@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "I@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(NumberNode "1"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "I@52c30b4d-5717-47cb-822d-b2caa44f94b9")
(WordNode "I"))

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "saw@0f223d17-31a6-49fe-9d37-350d50c53926")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "saw@0f223d17-31a6-49fe-9d37-350d50c53926")
(NumberNode "2"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "saw@0f223d17-31a6-49fe-9d37-350d50c53926")
(WordNode "see"))

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "it@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "it@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(NumberNode "3"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "it@f6c2a2a2-232b-4e33-9c93-72f211b475d3")
(WordNode "it"))

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "here@bf71826c-487e-42df-a941-0ecd3c942a76")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "here@bf71826c-487e-42df-a941-0ecd3c942a76")
(NumberNode "4"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "here@bf71826c-487e-42df-a941-0ecd3c942a76")
(WordNode "here"))

---------------------------------------------------------------------
(ParseLink (stv 1.0 0.17)

(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_1")
(SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d"))

23


2. Related Work

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "I@3fb606f8-0198-1668-e288-1de81ee1064a")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "I@3fb606f8-0198-1668-e288-1de81ee1064a")
(NumberNode "5"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "I@3fb606f8-0198-1668-e288-1de81ee1064a")
(WordNode "I"))

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "saw@2307ab5f-af07-9642-21f6-d04d45e355f2")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "saw@2307ab5f-af07-9642-21f6-d04d45e355f2")
(NumberNode "6"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "saw@2307ab5f-af07-9642-21f6-d04d45e355f2")
(WordNode "saw"))

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "it@b7505847-1b7c-82d3-6f5b-ededbeb85fe8")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "it@b7505847-1b7c-82d3-6f5b-ededbeb85fe8")
(NumberNode "7"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "it@b7505847-1b7c-82d3-6f5b-ededbeb85fe8")
(WordNode "it"))

(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "here@a8f70463-9441-d485-9711-a3d40a85e08f")
(ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0"))

(WordSequenceLink (stv 1.0 1.0)
(WordInstanceNode "here@a8f70463-9441-d485-9711-a3d40a85e08f")
(NumberNode "8"))

(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "here@a8f70463-9441-d485-9711-a3d40a85e08f")
(WordNode "here"))

24


3
Methods

The ULL pipeline is divided into sub-processes that handle different chores or sub-
modules, as explained in Chapter 1; from them, our thesis addresses corpora col-
lection, parsing and word sense disambiguation. The task that the whole pipeline
wants to tackle is not a simple one: the team building it is not certain if unsu-
pervised grammar learning from unannotated corpora with neither grounding nor
reinforcement feedback can succeed at all. To make the problem solvable at least to
some extent, some simplifications are performed, and some methodologies are pre-
ferred over others. A simplification concerning the module addressed in this thesis
(i.e. parsing) is that, instead of immediately processing text that resemble every-
day language, we collected a series of controlled corpora of increasing complexity.
This chapter explains the assumptions made for corpora. It explains as well the
methodologies implemented for the parsing and pre-processing learning modules of
the ULL pipeline. Finally, it explains the methods that will be used to evaluate the
performance of our implementations applied to the aforementioned corpora.

3.1 Controlled corpora

Once completed, the ULL pipeline is expected to derive a grammar after process-
ing large unannotated corpora. ‘Unannotated’ does not necessarily imply that the
texts may not be controlled, at least during the pipeline construction and testing
phases. The control that we implement on the learning texts within this thesis
project relates to different aspects. One of the aspects refers to the content of the
text; for example, we may clean corpus data to contain texts in one language only,
as the difficulties of learning more than that would increase substantially. Another
aspect relates to the complexity of the texts being inputted. For instance, we may
limit the vocabulary size of the text, or consider sentence length and composition
as other parameters to vary. This allows us to understand the performance of the
algorithms in simpler scenarios, isolate potential issues with them from the intrica-
cies inherent to more realistic language, and more easily find ways to overcome them.

If one looks back, the first task that we identified in section 1.2 as part of our contri-
butions to the OpenCog ULL project was to find or create adequate text corpora to
test our implementations. In this thesis we want our algorithm to learn the structure
of a language in a manner resembling that of human children. Recall when a baby
starts speaking: the first words are always objects, actions and simple concepts, like
‘mom’, ‘eat’ and ‘yes’. Afterwards, they learn the relation between these frequently

25


3. Methods

used vocabulary, and only later add more complicated sentence structure, like the
use of articles and adjectives or word order. Likewise, we elaborated and selected
corpora that start with simple structures and limited vocabulary and then gradually
increase their complexity.

Next we describe in more detail the corpora we will use, in the given order, for
testing and evaluating our implementations, as described in sections 3.2, 3.3 and
3.4:

• Proof-of-Concept Turtle (POC-Turtle) corpus - a manually created corpus
representing a closed semantic space of 16 different words communicated in the
simple Turtle language [19]. It contains a total of 12 sentences with 4 words
each, totalling 48 words1. This language is used in semantic web programming
and is limited to three words per sentence in a strict Subject-Verb-Predicate
triplet grammar. The complexity of such language can be thought closer to
complexity of language that non-human primates can learn [20] or that chil-
dren at age up to 3 can use [21].

Example sentences:
tuna isa fish .
parrot has wing .

• Proof-of-Concept English with no ambiguity (POC-English-NoAmb) cor-
pus - a manually created, closed semantic space of 19 English words with nearly
the same frequency of use, communicated in simple grammatical constructions
of 4-5 words per sentence. It contains a total of 36 sentences and 206 words,
none of which is considered ambiguous for the purposes of this thesis.

Example sentences:
A daughter is a child .
Cake is a food now .

• Proof-of-Concept English with ambiguity (POC-English-Amb) corpus - a
manually created, closed semantic space of very few English words (43) with
nearly the same frequency of use, communicated in simple grammatical con-
structions of 4-5 words per sentence. It is similar to the previous corpus (it is
actually a superset of it) but contains two words involved in semantically and
grammatically ambiguous constructions. Syntactic ambiguity is represented
by the word ‘saw’, which can be either a noun or a verb. The word ‘board’
is used to exemplify semantic ambiguity, with two different noun meanings
(surface to write on vs. group of people directing some activity). It contains
a total of 88 sentences and 573 words.

1 All corpus and vocabulary sizes are given after the corpus has been pre-processed for our
purposes (see section 3.1.1). Words, punctuation marks and any other symbols appearing as
separated tokens are considered words.

26


3. Methods

Example sentences:
Mom saw dad with a saw .
Dad is on the board of directors .

• Child Directed Speech (CDS) corpus - a sub-collection of the CHILDES
corpus [32] containing adults’ speech directed to children, characterized by
adapted lexicon and reduced grammar complexity. Specifically, it contains the
Brent-Ratner-corpus [6, 9] and the UCI-Brent-Syl-corpus [10]2. It contains a
total of 4k unique words in 38k sentences, for a total of 130k words.

Example sentences:
Did you wanna sit down ?
And they all rolled over and one fell out

• Gutenberg Children corpus - a compendium of books for children contained
within Project Gutenberg3, following the selection used for the Children’s Book
Test of the Babi CBT corpus [25]4. It contains a vocabulary of 40k words in
207k sentences, with a total size of 2.7M words.

Example sentences:
Alice led the way , and the whole party swam to the shore .
" They’re not in the past tense , " retorted aunt Jamesina .

• Brown corpus [20] - a 1-million word compilation of printed English prose
from 1961, covering 15 different categories including Press, Fiction, Humor,
among others5. After pre-processing it (see section 3.1.1), the corpus contains
a total of 482k words distributed in 35k sentences, with a vocabulary size of
43k words. This corpus is considered ‘natural’ English, and is the most ad-
vanced one among our tested corpora.

Example sentences:
Only within the framework of a mature relationship characterized by honest
appraisals of performance can we provide telling assistance .
The battle of the Naktong River is just one example of how the battle cry and
the spirit of The Fighting Seventh have paid off .

• NLTK’s Penn Treebank Sample - a sample from the full Penn Treebank
corpus [34], which is available through the NLTK suite [7]. It contains parse
trees of 10% of the whole Wall Street Journal articles inside the original cor-
pus. The corpus consists of a vocabulary of 6k words and a total of 2,285
sentences, for a total size of 38k words. Together with the Brown corpus,
it represents the most advanced English used in this thesis. The purpose of

2 Available at https://childes.talkbank.org/derived/
3 https://www.gutenberg.org
4 https://research.fb.com/downloads/babi/
5 http://www.sls.hawaii.edu/bley-vroman/brown_nolines.txt

27

https://childes.talkbank.org/derived/
https://www.gutenberg.org
https://research.fb.com/downloads/babi/
http://www.sls.hawaii.edu/bley-vroman/brown_nolines.txt


3. Methods

including a treebank is to use it as a Gold Standard during the evaluation of
automatically-generated parses (see section 3.4).

Example sentences:
The problem involves the motion of small magnetic fields within superconductor
crystals , limiting their current-carrying capacity .
The Underwood family said that holders of more than a majority of the stock
of the company have approved the transaction by written consent .

3.1.1 Pre-cleaner
So far we described the content nature of the corpora. Nevertheless, another im-
portant aspect to take into account with regards to corpora control is their format.
Collections of text like the ones just described normally come with a variety of
symbols and content that are not exactly part of the language structure that we’re
interested in learning, e.g. HTML markup and graphics, variations of the same
punctuation marks, UTF codes, capitalized letters, etc. In an ideal scenario, our
language-learner should be able to also understand the function of that content, but
at the current stage of the system it would benefit from using more standardized
input. In order to ensure that the input is self-consistent and useful to our purposes,
we pre-processed all the input corpora with a text-cleaning algorithm6. The pre-
cleaner contains several options for pre-processing the input (which can be turned
on and off) and can be grouped into the following broad categories:

• Character content of the text: lower-casing the corpus, standardizing different
versions of some symbols (e.g. quote symbols), removing non-allowed charac-
ters, removing html markup, tokenizing special characters.

• Structure of the text: sentence splitter, header remover, maximum sentence
and word lengths allowed.

• Encoding of entities: replacing numbers, URLs, dates, times with encoding
tokens (e.g. ‘@url@’), decoding escaped html and UniCode symbols.

For the purposes of this thesis, we pre-process the corpora before parsing using the
pre-cleaner default values, which imply the following actions:

• Determine sentence boundaries and split sentences into one per line.
• Remove headers (only for the Gutenberg Children corpus).
• Subtitute URL and e-mail addresses to special tokens. Same for dates, times

and numbers.
• Decode escaped HTML and unicode symbols to their printable version.
• Standardize single quotes to one symbol. Same for double quotes, dashes, long

dashes.
6Pre-cleaner available at: https://github.com/singnet/language-learning/tree/master/

src/pre-cleaner

28

https://github.com/singnet/language-learning/tree/master/src/pre-cleaner
https://github.com/singnet/language-learning/tree/master/src/pre-cleaner


3. Methods

• Tokenize quotes, double quotes and several other punctuation marks.
• Remove asterisks, underscores, long dashes.
• Remove sentences with more than 25 tokens.
• Remove words with more than 25 characters.
• Convert text to lower-case.

3.2 Parsing

The grounding for our parsing efforts in the ULL project7 is the unsupervised, mutual
information-based parser originally proposed by Deniz Yuret [55], which was briefly
explained in the previous chapter (section 2.1). The parsing pipeline consists of
three stages: gathering of frequency information of words and word pairs from the
training data, calculating the fractional mutual information (FMI) of those word
pairs, and parsing of sentences using the estimated FMI and a Maximum Spanning
Tree algorithm (MST). The counting and the parsing stages are independent and
consequently can be performed using the same or different sets of training sentences.
Each of the stages will be explained subsequently.

3.2.1 Frequency counting
Mutual information can be estimated by observing the frequency in which words and
word pairs appear in the studied corpora (see section 3.2.2). In order to determine
these frequencies, the system includes a text-processing pipeline that scans a text
document assuming that each new line represents a new sentence; then it tokenizes
the text, which in this case means breaking it into words; and finally it goes through
the text again word by word, sentence by sentence, saving the encountered word
instances and sentences in related atoms as described in section 2.38. Counting of
individual words in this manner is simple: it is possible to attach to each WordNode
a count truth value (a numeric valuation for an atom as described in 2.3) that should
be updated each time the text-processor runs into a corresponding WordInstanceN-
ode. Nevertheless, in this project we are not interested on the appearance of words
by themselves but on the presence of words in relation to other words. Thus we
need a different kind of structure.

Recall that the Atomspace has a special structure to represent link grammar links.
This structure uses a node with the name of the link type to connect twoWordInstan-
ceNodes corresponding to the appearance of a word-pair. Since we don’t have any
link types at this point and we are enticed to registering the repetitive presence of
word-pairs (and not their individual appearance), one possibility is to modify this
structure to use a generic link type and use WordNodes instead of WordInstanceN-
odes. The structure then would be as follows:

7 Code hosted at https://github.com/singnet/opencog/tree/master/opencog/nlp/learn
8 Unless we use LG, only one parse is assigned to each sentence and it is a sequential parse,

since we don’t have any parsing rules available at that point.

29

https://github.com/singnet/opencog/tree/master/opencog/nlp/learn


3. Methods

(EvaluationLink
(LinkGrammarRelationshipNode "ANY")
(ListLink

(WordNode "left-word")
(WordNode "right-word")

)
)

Another option is to create a new link atom where the link type is simply named
after its word-pair nature as follows:

(EvaluationLink
(PredicateNode "*-Sentence Word Pair-*")
(ListLink

(WordNode "left-word")
(WordNode "right-word")

)
)

In either case, the presence and frequency (N(x, y) and P (x, y); see section 3.2.2) of
these word-pairs in the text can be attached to these links as truth values. However,
to detect these ‘presences’ we need to process the text in an observation mode that
has a way to determine when the respective counts should be updated.

In this thesis project we considered three variations or modes for counting word-
pairs:

• LG-any: In this mode, for each sentence we sample a limited number of parses
using the Link Grammar Parser in random parsing mode (‘any’ mode), and
then count the pairs of words linked in these parses. Hence, the structure used
for keeping these counts is the first of the two mentioned above. The number
of different parses per sentence to consider should be given as a parameter.
Here we fixed it to 100.

• Clique-WIN: In this mode we use a Cartesian combination of per-sentence
words that are within a given window size. Viz., per sentence, the observer
iterates over each word and pairs it with every other word located within a
determined maximum distance to its right. Distance in this context is de-
fined as the difference between the words’ positions in the sentence; as such,
neighboring words have distance of 1. The window size or maximum distance
between words in a pair must be given as a parameter to the observer. For
our experiments we fixed it to 6, based in linguistic evidence that (most) syn-
tactic information in a sequence of words lies within this distance [28]. The
counts for this kind of pairs are held as values in the second type of word-pair
structures mentioned above.

30


3. Methods

• Clique-WIN-dist: In this mode we use a Cartesian combination of per-
sentence words that are within a given window size, and account for pair
distance. That is, we do the same as in the Clique-WIN method, except that
now each word-pair is counted a number of times determined by the: distance
(as defined in Clique-WIN) between the words. In the Clique-WIN method
in every iteration that we pair up words we increase their count by one; in
this method we increase their count by the quotient WIN/d, where WIN is
the window size and d is the actual distance between the words, thus favoring
words that appear closer together in a sentence. Once again, the window is
set to 6 and the structure used to keep the counts is the second of the ones
described above.

3.2.2 Mutual information calculation
Lexical attraction was defined in section 2.1 as the MI held in a syntactic relation.
However, because the appearance of a specific word in particular place in a sentence
is not the random variable itself but an instance of it, what Yuret [55] really calcu-
lated in his parser was the FMI of the word pairs. Furthermore, we are interested in
ordered word pairs, not individual words, so in the context of this thesis the random
variables used for defining the FMI of an ordered word pair are as follows:

• X is the random variable of the event of having a word x related to any other
word that is at its right in a sentence. Let * be a wild-card, i.e. any word
in the vocabulary of a language, then X represents the event of observing the
ordered word pair (x, ∗).

• Y is the random variable of having a word y related to any other word that
is at its left in a sentence. Similarly, Y represents the event of observing the
ordered word pair (∗, y).

Moreover, if a word x has another word y to its right, then the word y has word x
to its left; and p(X = x, Y = y) = p(Y = y,X = x) is the probability of observing
the word pair (x, y) in exactly that order. The random variables X and Y do not
have disjoint sample spaces given that words can be syntactically related to both,
words at their left and right in a sentence. Hence, to avoid confusion, we will set the
notation P (x, y) to be the probability of observing the word-pair (x, y) in a language
in that same order9. Using this notation and the notion of a wild-card explained
before, we can rewrite the equation to calculate the FMI of a word pair (x, y) as:

FMI(x, y) = log2
P (x, y)

P (x, ∗)P (∗, y)

This obviously implies that p(X = x) = P (x, ∗) and p(Y = y) = P (∗, y). Also
notice that P (x, y) and P (y, x) represent two different things.

9 In this sense, P (y, x) would be without ambiguity the short for p(X = y, Y = x).

31


3. Methods

Considering that it is impossible to know the relative occurrence of ordered word-
pairs in an entire language, we will use the word and word-pair counts collected from
observing texts (as described in the previous section) to estimate it. After running
the text-processor over a corpus, the Atomspace will gather statistics and hang them
in the created sentence structures. In particular, it will contain the number of times
each ordered pair (x, y) was observed in the text, which we will note as N(x, y)10.
Using these, we can estimate the value of P (x, y) as:

P (x, y) ≈ N(x, y)
N(∗, ∗)

where N(∗, ∗) is the total number of ordered word-pair observations. In turn, this
last one can be computed as:

N(∗, ∗) =
∑
x,y

N(x, y)

At last, we just need to calculate the marginal probabilities. These too can be
estimated using the word-pair statistics gathered:

P (x, ∗) =
∑

y

P (x, y) ≈ N(x, ∗)
N(∗, ∗)

P (∗, y) =
∑

x

P (x, y) ≈ N(∗, y)
N(∗, ∗)

Here N(x, ∗) is the total number of times the word x was observed paired to the left
of any other word in the text; and it can be computed as:

N(x, ∗) =
∑

y

N(x, y)

Similarly, N(∗, y) is the number of times the word y was observed paired to the right
of any other word in the text; and it can be computed as:

N(∗, y) =
∑

x

N(x, y)

Replacing the above computations of wild-card probabilities we can estimate the
fractional mutual information of an ordered word pair (x, y) as:

FMI(x, y) ≈ log2
N(x, y)N(∗, ∗)
N(x, ∗)N(∗, y)

Finally, recall that FMI is a ratio that compares the probability of observing the
two words together in a specific order vs observing them separately, and therefore
lies in the range (−∞,∞).

10 Note that N(x, y) is absolutely not the same as N(y, x) either.

32


3. Methods

3.2.3 MST algorithm

Once we have a set of ordered word-pairs together with an estimation of their point-
wise mutual information, we can proceed to parse the text making use of these
statistics. As we described in the introduction, the Opencog project uses an algo-
rithm for maximal parsing, but it is not any of the two illustrated by Yuret’s work
and mentioned in section 2.1. The reason is that the optimal algorithm has a high
complexity to be useful, and we are still willing to sacrifice a little performance ef-
ficiency (i.e. a little worse than O(n2)) in order to get more trustful results. Thus,
an MST (minimum spanning tree) algorithm was developed within the team, which
maximizes the total score of the tree11 instead of minimizing it, and restricts the
result to a projective tree as well12.

When it comes to projective maximal parsing algorithms, the state-of-the-art best
rated one is Eisner’s, which runs in O(n3) time [35]. Nonetheless, it uses dynamic
programming and it’s not the most straightforward to implement within OpenCog.
The simplest implementations of MST algorithms are the three classical ones: Prim,
Kruskal, and Borůvka’s [3, 43]. On one hand, modifying Prim to restrict crossing
edges would be simple, but this modification will make the outcome of the algo-
rithm heavily dependent on the initialization, which didn’t seem very promising13.
Kruskal on the other hand, conveys the impression of being too difficult to enforce
with a no-link-cross constraint. Hereinafter, the algorithm we are using is an im-
plementation of a variant of Borůvka’s algorithm, which seems like the most robust
and fast enough alternative for the project’s needs.

Overall, this algorithm takes a sequence of atoms representing an ordered list of
words and finds an unlabelled, undirected, projective dependency parse for it. This
means that we need a parser-method that feeds the algorithm with such a sequence.
Therefrom, we need to process the input text once more, tokenize each sentence
into a list of words, and create a sequence of atoms per sentence from the word list
(the atoms used here are of the WordNode type). The parser-method then takes
each sequence and creates a set of pairs (number.atom) where number corresponds
to the position of the respective word in the sentence. Finally it inputs this set as
the set V of vertices of a clique graph G = (V,E)14, together with the respective
set of edges E and a predefined edge scoring function f : E → R, to the modi-
fied MST algorithm which finds a projective tree T in G with maximal total score
F = ∑

e∈T f(e). A pseudo-code of such modified MST algorithm is described below
in Algorithm 1. The algorithm returns the tree as a list of atom-pairs, each with its
associated score.

11 The total score is the sum of the weights of each edge of the tree, as defined in any classical
MST algorithm [43].

12 A projective tree is one such that if we number the vertices and arrange them in order in a
line, the edges of the tree can be drawn above the vertices without crossings [35].

13 The success of the original Prim algorithm lies in the fact that if there is an optimal solution,
the algorithm will find it regardless of which is the first vertex chosen. That fact no longer holds
if we add the projectiveness restriction.

14 A clique is a graph where there exists an edge between every possible pair of vertices [43].

33


3. Methods

Algorithm 1: Pseudo-algorithm for modified MST parser
Input : A clique graph G=(V,E), where V is a linearly ordered set, and an

edge scoring function f.
Output: The set L of edges representing the links in the parse.
Take the edge c=(a,b) from E that has the highest score f(c);
Initialize a set L of links assigned to the parse to contain only edge c;
Initialize a set U of disconnected vertices to contain V-{a,b}, that is, all the
vertices of the graph except the ones adjacent to edge c;
Initialize the set C of connected vertices to be {a,b}, that is, a set with just
the vertices incident to edge c;

while set U is not empty do
Let O be the set of all edges with one endpoint in U and the other in C;
Initialize the candidate edge v to be nothing;
while O is not empty AND v is nothing do

Take the edge x from O that has the highest score f(x);
if x does not cross any links in L then

Let v be x;
else

Remove v from O;
end

end
if v is nothing then

Exit the loop;
else

Let v=(u,w) where u is in U and w in C;
Add v to L;
Add u to C;
Remove u from U;

end
end

The scoring function we input to the algorithm is very important, because this
numeric score is the one being maximized during the parse. It should be a func-
tion that receives as input an edge e, such that the distance d(e) between its two
endpoints is known, and outputs a real number. In the case of our structure, the
distance is simple to calculate given that our vertices are pairs (number.atom). Let
v = ((x, a), (y, b)) be an edge that connects word-atom a that is in position x in the
sentence with a different word-atom b that is in position y 6= x in the sentence, then
such distance is d(v) = |x− y| > 0. A very simple scoring function could just return
that distance, but we are interested in a function that favors links with meaningful
relations. Thus, in this thesis we experimented with two scoring functions that use
both the distance and FMI of word pairs:

• FMI: since we are interested in links with high lexical attraction, we use the
point-wise mutual information between the two words the edge is trying to

34


3. Methods

connect. We also don’t want links that are too long so we will give a scoring
penalty to links that have d > 16. Thus, this scoring function will return the
FMI to every word pair where it is defined, and a very negative value when it
is not or when the distance between the words is greater than 16.

• FMI-dist: this scoring function does the same as the one above, except that
we account for distance a little further. We don’t only want links between
words that are no further apart than 16 places from each other, but we also
want to favor links among words that are closer together in general. Thus we
will add to the point-wise mutual information of the pair an additional value
that is inversely proportional to the distance between the pair. Specifically,
when defined and within a distance of 16, the scoring function will output
FMI + 1/d, or a very negative value otherwise.

Finally, although the concept is very easy to understand, it is important to define
what it means for an edge to cross any links. Recall that V is linearly ordered, so
a relation < among the vertices should be defined. Let the edge be e = (j, k) such
that j < k, for any l ∈ L with endpoints x and y such that x < y, we say e crosses
l if one of the two is true:

• j < x AND x < k AND k < y

• x < j AND j < y AND y < k

3.3 Word-sense Disambiguation
After the input sentences are organized in parse structures, the ULL pipeline contin-
ues by trying