Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project Master’s thesis in MPALG and MPCAS CLAUDIA CASTILLO DOMENECH and ANDRES SUAREZ MADRIGAL Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2018 Master’s thesis 2018 Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project CLAUDIA CASTILLO DOMENECH ANDRES SUAREZ MADRIGAL Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2018 Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project CLAUDIA CASTILLO DOMENECH ANDRES SUAREZ MADRIGAL « CLAUDIA CASTILLO DOMENECH and ANDRES SUAREZ MADRIGAL, 2018. Supervisor: Luis Nieto Piña, Faculty of Humanities Advisor: Ben Goertzel, Hanson Robotics Ltd. Examiner: Morteza Haghir Chehreghani, Data Science division Master’s Thesis 2018 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2018 iv Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project CLAUDIA CASTILLO DOMENECH and ANDRES SUAREZ MADRIGAL Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract The work presented in the current thesis is an effort within the larger project enti- tled “Unsupervised Language Learning”, aiming to build a system that can learn the grammar of a language by processing a corpus of unnanotated text. In particular, the authors focused on the first steps of the pipeline, including the selection and pre-processing of useful text collections to build and test its performance, as well as the syntactic learning loop in which the system obtains statistics from sentences in a given corpus and leverages them to implicitly learn the syntactic structure of the corpus language. The process gathers statistics from co-occurrence of words in a sentence using different counting methods and estimates the pair-wise mutual infor- mation between word pairs. Using this knowledge and using a minimum spanning tree algorithm, the system automatically produces syntactic parses of the ingested sentences. The unsupervised parses, when compared to human-generated or rule- based standards, show a varying quality; however, they always perform better than a baseline of randomly generated parses, implying that the system is indeed learning some of the assumed underlying linguistic content from the texts. From the results we learned that the complexity of the input text is a determining factor on the method that performs best, leading us to conclude that a successful unsupervised parser should be able to, up to some extent, pre-assess this complexity before pro- cessing. Also, the outputs of our different parser methods all show that accounting for distance among word pairs when parsing yields better results. Nonetheless, to get a more confident evaluation on this implication it is important to have a stan- dard for comparison that bases itself on the same model assumptions. Additionally, we implemented a disambiguation process based in AdaGram as a way to build dis- tinct representations for different word senses within a corpus, which then annotates the corpus with tags representing different uses of each word. The purpose of this pre-processing step is to break polysemy in the corpora and provide a cleaner input to the parsing pipeline. We report that our experiments show either slight improve- ment in parse quality, or no significant change, if disambiguated corpora are used as input. Keywords: Unsupervised, language learning, parsing, disambiguation, computer sci- ence, engineering, project, thesis. v Acknowledgements First of all, we want to thank the teams at Hanson Robotics and OpenCog for welcoming us in Hong Kong to contribute to their visionary goals and for sharing their knowledge, time, and friendship with us. In particular, we are grateful to Ben Goertzel for his insights and for making the collaboration between the university and the company viable, and to Anton Kolonin for his constant organization of our efforts and for being always ready to propose a new experiment. Second, we want to recognize the contribution of our supervisor, Luis Nieto Piña, in making this thesis possible. We greatly benefited from his opportune and proper advice during the different stages of the project, as well as from his meticulous revi- sions of our manuscript, even during the rushed last days before submission. In all: ¡Gracias, Luis! On the personal side, we have also received invaluable aid that we want to acknowl- edge: Andrés: I am thankful for the unending support of my parents, brother and sister- in-law. I am also grateful to my friend Iuliia, who was always present during the good and less-good times of the past year. Claudia: I want to thank my family for their unceasing support during my studies. I am especially grateful to my aunt María Sol, my brother Carlos and his wife Andrea, for not only hosting me in their homes, but also taking care of me, encouraging me and being my strength in the most critical moments. Gothenburg, October 2018 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 The Unsupervised Language Learning Project . . . . . . . . . . . . . 2 1.1.1 Unsupervised Syntactic Analysis . . . . . . . . . . . . . . . . . 3 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Work 9 2.1 Language models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Lexical attraction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1.2 Mutual Information . . . . . . . . . . . . . . . . . . 11 2.1.2 Maximal parsing algorithm . . . . . . . . . . . . . . . . . . . . 13 2.2 AdaGram: learning disambiguated word representations . . . . . . . . 13 2.2.1 Word embeddings in Skip-Gram . . . . . . . . . . . . . . . . . 15 2.2.2 Multi-prototype AdaGram . . . . . . . . . . . . . . . . . . . . 16 2.3 OpenCog: an open source platform for AGI . . . . . . . . . . . . . . 17 2.3.1 Representation structures . . . . . . . . . . . . . . . . . . . . 19 2.3.1.1 Sentence representation . . . . . . . . . . . . . . . . 20 3 Methods 25 3.1 Controlled corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 Pre-cleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Frequency counting . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Mutual information calculation . . . . . . . . . . . . . . . . . 31 3.2.3 MST algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Word-sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Word-sense disambiguation . . . . . . . . . . . . . . . . . . . 41 4 Results 45 4.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 ix Contents 4.1.1 POC-Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.2 POC-English without ambiguity . . . . . . . . . . . . . . . . . 49 4.1.3 POC-English with ambiguity . . . . . . . . . . . . . . . . . . 52 4.1.4 Child Directed Speech . . . . . . . . . . . . . . . . . . . . . . 54 4.1.5 Gutenberg Children . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.6 Brown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.7 Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Word-sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.1 WSD in POC-English . . . . . . . . . . . . . . . . . . . . . . 66 4.2.2 WSD in Gutenberg Children . . . . . . . . . . . . . . . . . . . 68 5 Conclusions 75 5.1 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Bibliography 79 x List of Figures 1.1 Language Learning Project pipeline. . . . . . . . . . . . . . . . . . . . 2 1.2 Syntactic Learning Loop. . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Architecture of the Skip-Grammodel. Image credit: Leonardo Barazza [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Training sample selection in the Skip-Gram model. Image credit: Leonardo Barazza [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 An example of a random parsing. . . . . . . . . . . . . . . . . . . . . 39 3.2 An example of a sequential parsing. . . . . . . . . . . . . . . . . . . . 39 3.3 F1 score metric: graphic interpretation of precision and recall. Image credit: Walber [51]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Example of an incorrect parse. Corpus: POC-Turtle. Method: OpenCog (all 6) against GS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Example of an incorrect parse without distance accounting in scoring function. Corpus: POC-Turtle. Method: OpenCog (all 3) with FMI against GS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Example of an incorrect parse without distance accounting in the counting method. Corpus: POC-Turtle. Method: OpenCog Clique- WIN (with both scoring functions) against GS. . . . . . . . . . . . . 49 4.4 Example of an incorrect parse in standard. Corpus: POC-English- NoAmb. Method: LG-English (SS) against GS. . . . . . . . . . . . . 50 4.5 Example of a parse with incorrect linking of articles. Corpus: POC- English-NoAmb. Method: OpenCog Clique-WIN + FMI against GS. 51 4.6 Example of a parse with incorrect linking of negation particles and time adverbs. Corpus: POC-English-NoAmb. Method: OpenCog Clique-WIN + FMI against GS. . . . . . . . . . . . . . . . . . . . . . 52 4.7 Example of a parse with compound verbs. Corpus: POC-English- Amb. Method: LG-English (SS) against GS. . . . . . . . . . . . . . . 53 4.8 Example of a parse with syntactic ambiguity. Corpus: POC-English- Amb. Method: LG-English (SS) against GS. . . . . . . . . . . . . . . 53 4.9 Example of an incorrect parse. Corpus: POC-English-Amb. Method: OpenCog Clique-WIN + FMI against GS. . . . . . . . . . . . . . . . 54 4.10 Example of a parse excluding words and separating verbs with apos- trophes. Corpus: CDS. Method: OpenCog (all 6) against LG-English (SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 xi List of Figures 4.11 Example of a parse that doesn’t process apostrophes. Corpus: CDS. Method: OpenCog Clique-WIN-dist + FMI-dist against LG-English (SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.12 Example of a parse with different tokenization. Corpus: Guten- berg Children. Method: OpenCog LG-ANY + FMI-dist against LG- English (SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.13 Example of a parse with inline quotations. Corpus: Gutenberg Chil- dren. Method: OpenCog Clique-WIN-dist + FMI-dist against LG- English (SS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.14 Example of a parse with a compound sentence. Corpus: Brown. Method: OpenCog Clique-WIN-dist (2) and LG-Any (3) against SS (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.15 Example of a simple parse. Corpus: Penn Treebank. Method: LG- Any with FMI-dist (3) against GS (1) and SS (2). . . . . . . . . . . . 65 4.16 Example of a complex parse. Corpus: Penn Treebank. Method: LG- Any with FMI-dist (3) against GS (1) and SS (2). . . . . . . . . . . . 65 4.17 Examples and results for Model 1 and 2, trained for WSD in the POC-English corpus. Ambiguous word senses, according to the GS, are colored in the example sentences. . . . . . . . . . . . . . . . . . . 68 4.18 Linear effects for selected hyper-parameters on PWFS score for the Gutenberg Children corpus. . . . . . . . . . . . . . . . . . . . . . . . 70 xii List of Tables 4.1 Parsing results for POC-Turtle. . . . . . . . . . . . . . . . . . . . . . 46 4.2 Example of pair FMI calculation in POC-Turtle: word wing. . . . . . 47 4.3 Example of pair FMI calculation in POC-Turtle: word eagle. . . . . . 48 4.4 Example of pair FMI calculation in POC-Turtle: word herring. . . . . 49 4.5 Parsing results for POC-English without ambiguity. . . . . . . . . . . 50 4.6 Example of pair FMI calculation in POC-Turtle: verb + article. . . . 51 4.7 Parsing results for POC-English with ambiguity. . . . . . . . . . . . . 53 4.8 Parsing results for Child Directed Speech. . . . . . . . . . . . . . . . 55 4.9 Parsing results for Gutenberg Children. . . . . . . . . . . . . . . . . . 58 4.10 Parsing results for Brown. . . . . . . . . . . . . . . . . . . . . . . . . 63 4.11 Parsing results for Penn Treebank. . . . . . . . . . . . . . . . . . . . 64 4.12 Hyper-parameter effects on WSD scores for EnglishPOC. . . . . . . . 67 4.13 Parameters and scores for the best disambiguated POC-English models. 68 4.14 Parsing results for POC-English after disambiguation. . . . . . . . . . 69 4.15 Parameters and scores for the best disambiguated Gutenberg Chil- dren model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.16 Spearman’s ρ correlation for different similarity scores for our best disambiguated Gutenberg Children model. Results for all models but Model 1 are taken from Huang et al. [27] . . . . . . . . . . . . . . . . 71 4.17 Parsing results for Gutenberg Children after disambiguation. . . . . . 73 xiii List of Tables xiv 1 Introduction Language is perhaps the single most important feat that the human kind has achieved in history. Unparalleled by any other species on Earth, early human verbal communication offered a more efficient way to transmit ideas and abstract thoughts, and share experiences with others, thus accelerating learning and the progress of societies throughout the world. Current-day machines have their own internal lan- guages, also invented by humans, and they’re quite efficient with it. However, those differ considerably from natural languages (languages created by humans through evolution, without intentional design) to the point that only a small percentage of the population is fluent in them. Natural Language Processing (NLP) is a multidisciplinary area of research that strives to create systems that are able to communicate externally through the use of human languages [29]. Using human language to express instructions to a com- puter and obtain its results would greatly enhance its efficiency of use. This is a non-trivial task, and although we already have decent speech recognition, sequential translation, and question-answering systems (just to mention a few applications), we still cannot make programs that fully comprehend language. Some even consider it an AI-complete problem: a problem that requires self-conscious systems, or ‘true AI’ to be solved [54]. Traditionally, approaches to tackle Natural Language Processing have depended on a human-in-the-loop, either to write sets of explicit rules that allow a system to disentangle the different parts of a language input, or to annotate text corpora for the system to learn from. The complexity of human language makes a set of rules or manual guidelines unlikely to cover all possibilities the language can offer, let alone when the semantic layer is to be taken into account, which seems necessary for a system to completely understand a language. Moreover, learning from text corpora in a supervised manner requires lots of data to be pre-processed and labeled before training, and unfortunately most of the annotating efforts have been centered in English (and a few ohter widely spoken languages), which means that these methods won’t have quite the same efficiency in most other languages. Therefore, an unsupervised learning system that could derive meaning and learn any language’s grammar (the structural rules present in that language) from unannotated corpora appears to be an adequate approach to solving the problem. 1 1. Introduction 1.1 The Unsupervised Language Learning Project The purpose of the Unsupervised Language Learning (ULL) Project1, developed by researchers in Hanson Robotics2 and the OpenCog Project3, is to implement the NLP pipeline proposed by Vepstas and Goertzel [49], aiming to achieve the understanding of any language by training on unannotated text corpora. When completed, the process is expected to unsupervisedly learn a language grammar, a crucial tool needed for an NLP system to understand that language. The pipeline is based on a built-in linguistic model framework that supposes the existence of grammatical dependence, as well as syntactic and semantic structure in the language. The different layers of the model exploit and learn such components of the language. Figure 1.1 shows a high-level diagram of the described learning process. Figure 1.1: Language Learning Project pipeline. In the diagram we can see how the units of the proposed pipeline interact in a re- cursive manner to learn the structures of each of the three language components, and finally obtain logic induction from them, all starting from raw text. In partic- ular, the syntactic analysis processes receive feedback after the semantic analysis is performed. At the same time, learning the context of these structures can provide some tools for refined abstractions. The objective is for these components to be able to achieve a higher generalization of the structures present in the language. As the authors of the proposed pipeline state, “this can be viewed as a form of ‘deep learning’ at work: a higher, more abstract layer can serve to refine the correctness of a shallower, more concrete layer”[49]. After enough iterations, these three units of the language learning process should be capable of deriving “a system of relationships that display appropriate coherence and that, when applied by an appropriate parser to the inputs from the previous layer, will yield parse trees that reflect the information content of the input”[49]. By ‘enough iterations’ here it is meant a level of abstraction that will allow comparison and interaction with a model of the physical world so that the language model serves to represent it. Certainly, training all these in an unsupervised manner requires a large corpus of text, but once trained, the system will be able to understand 1https://wiki.opencog.org/w/Language_learning 2www.hansonrobotics.com 3www.opencog.org 2 https://wiki.opencog.org/w/Language_learning www.hansonrobotics.com www.opencog.org 1. Introduction previously unseen text by using the learned language structures and applying logic induction. 1.1.1 Unsupervised Syntactic Analysis As one can see in the diagram (Figure 1.1), the first component of the learning process is the syntactic analysis. Although some implicit semantic analysis will be performed as well, the main focus of this thesis is centered on this first component; to highlight the difference we have used for it a different color from the other two components in the diagram. The next section discusses in detail the topic of this dissertation, but before getting into the specifics of it, we provide a short overview of the original authors’ ideas about how to handle the extraction of syntax. In order to understand the particulars of the learning process, it is necessary to first understand the authors’ general approach behind all of the component learning loops. The high-level algorithm of the pipeline can be described by the following steps: A. Define words to be ‘objects’. B. Look for correlations between objects. C. Express the correlations in a convenient representation. D. Cluster similar objects together into classes. E. Define a new set of objects as the clusters obtained from the last step. F. Repeat from B until certain level of abstraction is obtained. G. Use definitions to apply logical induction on unseen data. Observe that the ‘feedback’ to bootstrap the language components depends heavily on steps B, C, and D. In the context of syntactic analysis, finding correlations (step B) is exactly what a parser does. Thus, as the first steps in the ULL pipeline, the system should be able to observe sentences from an unannotated corpus and statis- tically induce a grammar from them, then obtain dependency parses with untagged labels between their components. These parses can be subsequently fed as input to the following learning layers, or continued to be processed to provide feedback for itself. In this way, recursion can be done at different levels by either using the feed- back from the same component, or by taking in refinements from other components when redefining the ‘new objects’ accordingly. Figure 1.2 shows how this recursion could be done in detail for the syntactic compo- nent. In a first pass of the currently proposed approach, raw input text is tokenized and separated into words, which will be our first ‘objects’. Then, Mutual Information (MI) measurements are used to decide how likely it is for words to appear together in a sentence (as an induction of a rough initial grammar). Finally, a Minimum Spanning Tree (MST) algorithm uses this rough grammar based on word-pairs’ MI to build a dependency parse tree for a sentence. In later passes of the algorithm, these same settings can be used to process the newly defined ‘objects’ 4. Alterna- tively, we could use a grammar derived from our own method in further steps and 4 For example, the new objects can be word-senses after disambiguation; or it can be words tagged with its categories obtained from a previous pass through the layer of semantic analysis. 3 1. Introduction Figure 1.2: Syntactic Learning Loop. parse the input in a dictionary-based manner. Once the first steps are able to unsupervisedly produce acceptable parses from raw text, they will provide a database of dependent word pairs representing an enor- mous amount of highly fine-grained information regarding the syntactic rules of the language. The next natural step is organizing these relations to derive an improved structure of the learned language, and producing from it a proper grammar that could be comparable to hand-crafted rules like those used by, e.g. the Link Gram- mar [46]5. This is equivalent to performing step D of the general algorithm; but in order to do that we first need to pre-process these parses or correlations in an appropriate way to allow for further organization and clusterization. In other words, we need to perform step C first. A proper representation of our objects is therefore needed. Taking into consideration the information contained in the parses and the way most existing clusterization methods work, a vectorization of our objects seems appropriate. This is because having objects (words, word pairs or any other linguistic unit, in a first pass of the pipeline) represented in a geometrical space makes future comparisons and calcula- tions simpler than having to deal with categorical objects. A possible approach is 5 Link Grammar (LG) is a theory of syntax and morphology originally proposed by Davy Temperley and Daniel Sleator (1991), and contributed upon by several developers. It parses a sentence by directly connecting pairs of words instead of generating a phrase structure hierarchy like constituents grammars do. It differs from a dependency grammar in that links between words lack a head-dependent relationship, i.e. they are undirected. Further details as well as the parser code are available at the main website: https://www.abisource.com/projects/link-grammar/ 4 https://www.abisource.com/projects/link-grammar/ 1. Introduction to use dimensionality reduction based on context estimation: utilize a word embed- ding process that learns to represent concepts, in this case our objects, as vectors in a continuous multidimensional space6. This approach is convenient because most natural languages have in average more than 100,000 words in their vocabulary, and dealing with such high dimensionalities would be complex and time consuming, not to mention computationally expensive. Thus, using these smaller object vectors, it is possible to proceed with the next step: clustering of words into groups that resemble lexical categories. The idea with clustering is to reduce the amount of highly fine-grained syntactic information obtained from the parser by extracting common patterns from it that can derive into a grammar. A suitable approach in early iterations would be to reduce the number of rules obtained from the parser by grouping together words that behave in a similar way, i.e. that frequently appear linked to the same word(s). The purpose of this categorization of words is to learn a grouping that could coarsely correspond to part-of-speech (POS) classes. In later iterations, when dealing with higher-level abstractions, the same grouping strategy could learn semantic relations that could be again fed back to the parsing and categorization system, and perhaps learn idioms and set phrases. Finally, the grammar rules obtained after clustering can be used to re-process the raw text, this time using a rule-based parser, or they can be fed as input to the following analysis component loops. 1.2 Contributions The current thesis project strives to add to the first steps of this large NLP research effort by, as mentioned before, concentrating on the syntactic loop only. So far, we have presented a short summary of the proposal as it was originally stated by its au- thors [49]7. It is important to note that the different pipeline components presented so far follow thoroughly-studied hypotheses that have been previously presented and partially tested by other authors, but had not been connected together to form a complete pipeline as attempted by the current project. Certainly, much has changed since the date of publication of the authors’ first article, at the academic level and in relation to certain parts of the pipeline; and now we can use these advances to contribute to its development even further. Here we would like to broadly describe the actual implementations that our thesis project contributed with; but before we can do that, we need to provide a quick description of the development conditions of the general project before our participation. When we joined the OpenCog project to work on the current thesis, there was al- ready an alpha version of the parser with some experimental results, but it was not 6 It is expected that this type of object representation will implicitly learn some meaning related to the objects. This is the only part of the syntactic loop that deals with semantics without receiving feedback from the semantic loop. 7 For an alternative description of the project, check out the source code’s Readme file at https://github.com/opencog/opencog/tree/master/opencog/nlp/learn 5 https://github.com/opencog/opencog/tree/master/opencog/nlp/learn 1. Introduction suitable for our purposes8. First, there was a description (Readme file and Wiki page) of the general project but no thorough documentation for each function and script existed, so every new person interested in collaborating with the project had to read through the entire source code to understand what each component did. Second, the installation process of the pipeline in a new computer was complex and there were some mistakes in the output of the parser in random runs that appeared to be related to this installation process. Third, the results available from experi- ments conducted with this alpha version of the parser were not properly organized, it was unclear which specific corpora had been used as input in each case. Also, at least some of the results were obtained using Wikipedia articles as the input text, which did not provide good results given that the fact-informative nature of these articles bring little variety in pronouns or action verbs, and contains large numbers of proper names, foreign language words, dates, tables, etc., that hinder the learning process and can lead to unusual deductions of grammar. Given the development state of the project, it seemed like a logical way to contribute to start by gathering useful corpora, implementing a pre-processor of data, revising and amending the state of the alpha parser, and possibly working with word sense disambiguation before parsing as part of the syntactic analysis loop. With this in mind, we identified the need for five specific tasks to be developed: 1. Find or create adequate text corpora to test the efforts of the project. 2. Create a configurable pre-processor to simplify these corpora as needed. 3. Understand, modify and test the existing OpenCog-based parser to make it useful to the project. 4. Integrate word disambiguation and representation learning algorithms (as part of pre-processing). 5. Write proper documentation for the three previous tasks. In particular, the question we aim to answer in this thesis project, by performing the described tasks, is whether it is possible to learn the syntactic structures of a language in an unsupervised manner and get results that are at least as good as the existing supervised methods. We will try to achieve this by assuming that each language has an innate dependency structure between word pairs and aim to learn this structure from observing a lot of text. As part of the ULL project, our ideas will be based on the authors original proposal, but we will use our knowledge of the latest developments in the field to try to advance even further. Technically speaking, the goals of this thesis project are to improve, combine and modify existing algorithms (which may be used for other purposes) to work as a part of the proposed pipeline. According to the results some extensions to this project are proposed in the Conclusion chapter. Finally, one challenge we aim for is making these modifications using graphs and tree-like structures in such a way that the resulting algorithm can be applied to numerous other fields outside NLP. 8 One of the project contributors experimented with the other two parts of the syntactic loop as well, but none of his results at that moment looked like something we could build upon. 6 1. Introduction 1.3 Scope and Limitations This thesis project deals with a problem that is part of a bigger and rather ambitious research effort: solving unsupervised language understanding. If such a research ef- fort succeeded, the conceptual contributions to the scientific community would be immensely valuable. Nowadays, developing an unsupervised language parser for any language would already be a big milestone. We considered that solving the problem of obtaining simple grammar and syntax structures and finding a proper representation for them to improve a posterior grouping into lexical categories, is a key piece necessary before experimenting with recursion. Achievement of favorable results in this topic would be equivalent to overcoming the next obstacle, allowing for further development in more abstract layers of language abstraction and under- standing. Thus, this is the current focus of the OpenCog team: experimenting with one forward pass through the syntactic loop only. Moreover, as part of the OpenCog team research project, the results of this thesis will be limited by one pass of the pipeline as well. It is important to keep in mind that according to the authors’ proposal, refinements in learning are obtained through recursion, so without recursion the outcome of a first pass through the pipeline is expected to be of limited performance. For this reason, we will also limit our evalu- ation standards, with the hope that once recursion is implemented in the future, our method will succeed to surpass not only the state of the art of unsupervised meth- ods but supervised methods as well. It is also relevant to mention that the scope of our thesis does not involve the embedding nor clustering sections9. Therefore, we found the experimentation with word-sense disambiguation before parsing to be a convenient way of evaluating how the parsing outputs can vary if the input to the pipeline (our ‘objects’) get refined. Finally, because we won’t be able to harness the power of recursion during a single pass of the learning algorithm, we decided to simplify the input data through a pre-processing step, in order to understand better if the implemented ideas work in a simple case, thus assessing their performance. Because of time restrictions, we decided to focus on texts that have different complexity levels but that belong to the same language. Since the most extensive supervised NLP efforts have been realized in English, we restricted our experiments to English corpora as well, so that we have a benchmark for comparison10. Following the previous explanations, the structure of this paper is as follows: • Chapter 2 introduces the related work relevant to our topic. It specifically ex- 9 Other members of the OpenCog team are working in parallel to us with these sections. Unfortunately, because of time limitations their results will not be possible to publish or use for comparison within this thesis. 10 One of our collaborators has tested the pipeline with Chinese, but his scripts require the text corpora to be first processed by another algorithm that learns to split the text into words before inputting to the pipeline, and this effort is outside of our thesis domain. 7 1. Introduction plains the two main theories that our thesis will develop on, which are Yuret’s unsupervised parser [55] and AdaGram word embedding [2]. It also gives an overview of the OpenCog project. • Chapter 3 presents the ways we will implement the ideas taken from the related work in the OpenCog environment and what tools we will use to evaluate our results. This chapter offers also a description of the text corpora we selected. • Chapter 4 presents the results obtained from our experiments, mainly com- parison charts and useful output examples. • Chapter 5 summarizes our findings, gives recommendations of what could be improved and suggests ways in which this project could be continued in the future. 8 2 Related Work The amount of research involving methods for natural language processing systems is extensive. Nonetheless, as mentioned before, most of the existing approaches have a high dependency on the input of human experts on the field. Klein and Manning [31], more than a decade ago, summarized results from efforts towards learning a grammar without human supervision, while exposing the difficulties with such a task. When it comes to unsupervised learning of specific tasks (for instance syntax parsing, word sense induction, or document translation), some attempts have been made, but none of them can equal the results obtained by their supervised coun- terparts, which most of the time are limited to one language, or even one context. A more recent appraisal of the general task of language acquisition is offered by Dupoux [19], where stress is made about using, at most, a limited amount of super- vision and ‘wild’ data, similar to the one that children are exposed to. In this context of new approaches, several unsupervised parsing algorithms that make use of Harris’ distributional hypothesis [24] are available in the literature [31][15][11], which leverage some linguistic knowledge (e.g. part-of-speech of words) and are therefore not entirely unsupervised. To our knowledge, the state of un- supervised parsing has not improved considerably in the last decade. Deniz Yuret [55] built a truly unsupervised parser based in maximization of mutual information between words, which did not perform as well as supervised approaches, but he pointed out potential improvements. With respect to word embeddings, methods to unsupervisedly learn dimensionally-reduced real-valued vector representations of words have been proposed in the past [5] [36, 37] [39]. Algorithms like these would prove useful for our current ULL purposes, as a way to learn representations for words that would allow for word-sense disambiguation and refinement of the input data. The iterative Unsupervised Language Learning (ULL) pipeline originally introduced by Vepstas and Goertzel [49] brings together generalizations and extensions of pre- vious theories by these and other authors to tackle language learning. A complete description of such theories is respectively cited in their article, but for the pur- poses of this project, we will make a short review of the most relevant ones. Our work makes use of algorithms for parsing and for word representation learning in an unsupervised manner. This chapter presents the theory behind those algorithms. 9 2. Related Work 2.1 Language models If we want a computer to ‘understand’ a human language, we first need to find an adequate way to represent it. When it comes to modeling, a natural question that arises is if a model can actually exist within such a complex process. The idea that a distributional structure is present in both the written evidence of the language and the speaker itself is old and well studied [24]. For example, several experiments show that humans have different levels of associations between diverse types of words, which can be measured by comparisons in their response times after exposure to them [14]. Hence, the question is reduced to: which is the best model given a particular context or purpose? On the pursuit to answer this question, context-free grammars have historically been used to describe the structure of sentences and words in a natural language. Their name derives from the fact that it uses simple replacement rules which are always applicable, independently of the context in which they are found. A lexical system is a language model that describes the syntax or structure of the language in terms of its basic primitives, like words or phonemes. Most of the lexical systems, and in particular the ones upon which this project develops, are based on context-free grammars. The first lexical systems using probability to mathematically represent the struc- tures present in language were the n-gram models [29], where an expectation value was assigned to a word given the previous n-words in the sequence. These models report fair performances when applied to particular contexts, but are mostly limited to simple sentences. If we want a model to capture more complex relations, such as the ones determined by compound sentences, or word sequences longer than the n-window scope of n-grams, we need a different kind of model. Bengio et al. [5], argue that one needs to learn the joint probability of the words in the sequence for that purpose, and thus face the curse of dimensionality1. They introduce a neural probabilistic model to solve this. An alternative to the massive use of computational power that complex modeling or neural networks require, is the use of independence assumptions. A dependency grammar, for instance, bases its analysis on the direct relation between the linguis- tic units, e.g. words, and assumes that more complex relations can be built upon these. A parser that is based on a dependency grammar tries to assign a dependency structure to every sentence. A dependency structure is a tree (in the graph sense) composed of planar directed arcs that connect all the words in a sentence, in such a way that each arc connects only two words at a time and no arc passes over the root word [47]. These characteristics of dependency structures make it easier for the parser to gather statistical information about the distribution of words relative to each other. 1 This expression was introduced by Richard E. Bellman within the field of dynamic optimization [4]. It the context of machine learning it refers to various problems, such as data sparsity or computational deficiency, that derives from trying to analyze data in high-dimensional spaces. 10 2. Related Work On the late 90’s, Yuret introduced a formalism based on dependency structures, where the relation between individual pairs of words are the basic primitives of the language [55]. He argued that basing the analysis of a language only on a fixed amount of rules does not suffice because both syntax and semantics are needed to fully understand the language, and that both complement each other. In his words, “what makes humans good learners is not sophisticated learning algorithms but hav- ing the right representations” [55]. In his PhD thesis he refers to studies showing that children automatically learn the likelihood of the connection between concepts from their experience with the world, and use this knowledge to break the ambiguity in meaning, even without explicitly knowing any syntax rules. Moreover, he sustains his posture that syntax alone is not enough given the single-dimensionality of lan- guage: only a finite (limited) amount of different arrangements of words are possible, giving rise to the one-to-many mapping between conceptual relations (meanings) and syntactic relation (ordering of words). 2.1.1 Lexical attraction For the purpose of this thesis, we chose to model language the same way Yuret did when he presented his first unsupervised parser based on lexical attraction. Lexical attraction models are a class of probabilistic language models founded on the aforementioned formalism where the language’s ‘building blocks’ are the words. Lexical attraction is defined as the measure of affinity between those building blocks, i.e. “the likelihood that two words will be related in a given sentence” [55]. The concept of lexical attraction is based on the information-theory concepts of entropy and mutual information. 2.1.1.1 Entropy In his Mathematical Theory of Communication, Shannon [45] proposes the idea of understanding words in a sentence in a natural language as a sequence of tokens which can be represented by a discrete random variable. He defines the entropy of such variable as the average information content that each token in the language can have. Mathematically entropy is calculated by the formula: H = − ∑ pi log(pi) where i ranges over the possible values of the random variable and pi is the prob- ability of the variable taking the value i. Note that it is a weighted average of the ‘information content’ of each possible event, which in turn is a negative logarithmic function (− log(pi)). Consequently, events that are less ‘likely’ to happen (i.e. have a low pi) are considered to have more information content because they are less expected. 2.1.1.2 Mutual Information Mutual information (MI) is a measure of how much more information content can be extracted from a random variable, given the knowledge of another random variable 11 2. Related Work [17]. It is calculated as 2: I(X, Y ) = ∑ x∈X,y∈Y p(x, y) log p(x, y) p(x)p(y) Like entropy, mutual information is also a weighted average of a logarithmic function of probabilities, but in this case the function has two random variables. Thus, mutual information describes the dependency between those two variables. MI can actually be expressed in terms of entropy: I(X, Y ) = H(X)−H(X|Y ) = H(Y )−H(Y |X) = H(X) +H(Y )−H(X, Y ) These identities provide a more intuitive and Bayesian way of understanding MI. In particular, it allows us to see that MI is symmetric and non-negative, and that it is equal to zero if and only if the two random variables, X and Y , are independent of each other. Yuret defined lexical attraction as the likelihood of a syntactic relation [55]. MI compares probabilities and is therefore a measurement of likelihood. In the equa- tion above, if we take X to be the random variable related to one word appearing in a certain position of a sentence and Y to be the same but for another word and position in the sentence, then the mutual information between the two variables is a measure of how much information content do the two words appearing in particular order and place provide to the sentence structure. This particular order and place is exactly what a syntactic relation encloses. In this sense, lexical attraction can be understood as the mutual information captured in the syntactic relation between X and Y . Moreover, MI measures the average entropy between all the possible outcomes of two random variables. However, in some cases we might be interested in quantifying the information content of an individual pair of events. For example, if we want to know what is the likelihood of two specific words in a vocabulary being related. In such cases we are interested in what is called the point-wise mutual information (PMI), also known as fractional mutual information (FMI) [14]: I(x, y) = log p(x, y) p(x)p(y) Note that I(x, y) here is short for I(X = x, Y = y). Unlike MI, PMI can be negative (I(x, y) < 0) and this happens when x and y are in complementary distributions (for example, antonymous words) because p(x, y) is 2 It is important to comment about the notation used here. In this equation p(x), p(y), and p(x, y) are short for p(X = x), p(Y = y), and p(X = x, Y = y) respectively; where X and Y are different random variables with possibly different sample spaces. This means that even though by symmetry p(X = x, Y = y) = p(Y = y, X = x), it is not necessarily true that p(x, y) = p(y, x). The reason is that, to be consistent p(y, x) in this context is short for p(X = y, Y = x) and not for p(Y = y, X = x); the first of which may not even be admissible. 12 2. Related Work very small compared to p(x)p(y)). PMI can also be approximately zero (I(x, y) ∼ 0) if there is no interesting relationship between the two events. In contrast, when there exists a high association between the events x and y, their joint probability p(x, y) will be much larger than observing them by chance or individually (p(x)p(y)); hence their PMI will be positive (I(x, y) >> 0). Accordingly, in the previous example, word pairs that appear frequently together will most likely have high PMI. 2.1.2 Maximal parsing algorithm Yuret’s model of lexical attraction assumes that words with high lexical attraction are likely to be syntactically related within a language. Taking the information- theory approach, the probability of observing a sentence can be computed as the product of the probabilities of observing each word (e.g. an unigram model); but under the assumption of a dependency grammar model, words are not independent of each other. Thus the probability of a sentence depends as well on the conditional probabilities of the relationship between the words in that sentence. To avoid the complexity of long term dependencies, Yuret shapes his lexical attraction model as a Markov network, where each word depends on only one other word in the sentence, not necessarily an adjacent one, but it’s independent of all the others. He then shows that the total entropy of a sentence in such a model is completely determined by the lexical attraction between the dependent words. In this way, because the most probable language model is also the one that produces the lowest entropy, he proves that “the search for a low entropy language model leads to the unsupervised discovery of syntactic relations” [55]. Furthermore, upon this discovery he tackles language understanding through the use of an algorithm for maximal parsing. The goal of the algorithm is to find the dependency structure that assigns a given sentence a high probability based on the lexical attraction. In other words, the algorithm’s objective is to find the parsing tree (which by definition has no cycles and a planar structure) which assigns the highest total mutual information to the words linked by the tree. He describes an algorithm based on Viterbi that obtains the optimal solution in O(n5) time, where n is the number of words in a sentence. He also proposes an approximation of his own which uses the properties of planarity and cycle restriction and has a performance of O(n2). This approximation algorithm does not always find the most likely parsing tree and in some cases it might leave some words disconnected, but in practice it has a good average performance and its simplicity in implementation makes it a good candidate for training. 2.2 AdaGram: learning disambiguated word rep- resentations The impact caused by the deep learning and neural network revolution has reached language learning, as well. By early 2000’s, Bengio et al. [5] introduced a system 13 2. Related Work that learned distributed representations3 for words in an unsupervised way by using their context. Inspired by statistical language models and the distributional hypoth- esis [24], they obtained better results than the best existing n-gram models of the time, even more efficient than Latent Semantic Analysis [18] and Latent Dirichlet Allocation [8]. Their proposed neural network was able to learn both word repre- sentations and a language model at the same time. Improvements followed in the coming years. Particularly, Mikolov and colleagues’ Skip-Gram model [37, 36] grabbed the community’s attention because of its ability to predict the context of a word with high computational efficiency, making it suit- able for online processing of text. Skip-Gram stood out as well for the unexpected semantic content embedded in the learned vectors: some algebraic operations on word vectors give results similar to analogy tasks. For example, if one subtracts the vector for the word ‘man’ from the one for ‘king’, and add the vector for ‘woman’, the result is very close to the vector for ‘queen’. A problem not addressed by Skip-Gram is that of ambiguous words. When a word has different meanings that occur in different contexts, it has to average between its learned representations and ends up dominated by the most common use of the word. Hence, in the best case, a word will be correctly predicted from its dominating context and ignored in the other ones. In a less fortunate situation, the resulting embedding will not be properly predicted for any of the word sense’s contexts. AdaGram (Adaptive Skip-gram) [2] is a neural network-based continuous represen- tation learning algorithm based in the Skip-gram model which addresses the afore- mentioned limitation. It is modified to build a different vector representation for each of the significant word’s senses that appear in the training data, while still offering speed of processing and meaningful representations. Various other neural network architectures have been proposed to accommodate ambiguity, but they lack in processing efficiency [27], require external knowledge like WordNet senses [13], or allow only a limited/fixed number of senses per word [41, 48]. As part of the ULL pipeline, AdaGram4 will be used for two different purposes: to disambiguate polysemous words, and to learn word representations that can subse- quently be used for finding syntactic categories in the language. In this thesis we will tackle only the first purpose. AdaGram takes unlabeled text as input and learns different representations for each word in the vocabulary by training the network using word pairs taken from the context of the target word, following the princi- ples of the distributional hypothesis [24]. Our expectations for this thesis is that the 3 A distributed representation is a way of expressing an entity (in this case a word) with “a pattern of activity distributed over many computing elements”, where each computing element is involved in the representation of other entities as well [26]. When it comes to word embeddings, computing elements are associated with the dimensions of the vector space, where each dimension is related to a characteristic of the word (part of speech for instance). Distributed representations are known for its computational efficiency because of its network structure with simple, neuron-like computing elements. 4 The repository containing AdaGram can be found in https://github.com/sbos/AdaGram.jl 14 https://github.com/sbos/AdaGram.jl 2. Related Work Figure 2.1: Architecture of the Skip-Gram model. Image credit: Leonardo Barazza [1]. learned representations correspond to useful word senses, but in order to understand how this can happen we need to go through the algorithm. First, we’ll introduce how Skip-Gram works, and then the modifications included in AdaGram to offer multiple prototypes per word. 2.2.1 Word embeddings in Skip-Gram A great deal of the success of Skip-Gram comes from the simplicity of the network used to build word prototypes (as embeddings are regularly called), which allows for a low complexity during training and operation. The full details of the network’s operations can be found in the given reference, but its architecture is shown in an example in figure 2.1. The network is composed of a one-hot input layer of dimen- sionality V that codifies the input word as an index, a fully-connected linear hidden layer of size D, and a hierarchical soft-max output layer with V neurons. V here corresponds to the size of the training vocabulary (10000 in the example) and D is a hyper-parameter of the model (300 in the example) which determines the reduced dimension (i.e. V >> D) of the learned representations. For training, each word xi of the input text X (containing N words) is assigned a context yi from its C neighboring words (another hyper-parameter) both preceding and following it, all converted to an index in the range [1,V ] using a simple look-up table built from the whole corpus. Then, the currently trained word is fed into the network in one-hot encoding format, and one of the context words is used as the training sample to evaluate the loss function; error backpropagation follows. This 15 2. Related Work Figure 2.2: Training sample selection in the Skip-Gram model. Image credit: Leonardo Barazza [1]. task is repeated for the rest of the context of this word, and the whole operation for every word in the corpus. This constitutes one training epoch. Figure 2.2 illustrates the training sample selection in the case of a context of size 4, where xi is shown in blue, and yi is represented by the white rectangles. Notice how the size of the context is smaller for words near the limits of the text. As stated by Bartunov and colleagues [2], Skip-Gram’s training objective is to find the parameters θ (the neural network’s weights) that maximize the likelihood of context words Y , given the training words X. Reproducing here their equation 2, that is: p(Y |X, θ) = N∏ i=i C∏ j=1 p(yij|xi, θ). The loss function is simply the mean squared difference between the hierarchical soft-max values and the one-hot vector representing the expected context word at each training step. Once the network is trained, Mikolov et al. use the parameters between the input and hidden layers as continuous vectors to represent each of the words. 2.2.2 Multi-prototype AdaGram The Adaptive Skip-Gram model of Bartunov et al. introduces modifications to the Skip-Gram architecture, aimed to allow a single word to potentially have multiple embeddings that refer to its use in different contexts, or senses, thus offering a so- lution to polysemy. These modifications basically amount to introducing T input neurons per word (a hyper-parameter which specifies the maximum number of mean- ings a word can have), with their respective connections to the unchanged hidden layer. The output layer is not modified and holds only one neuron per word in the 16 2. Related Work vocabulary, thus being trained on word samples given sense inputs. Because not every word is expected to have the same number of senses, AdaGram uses a Dirichlet Process allocation of new senses that ‘activates’ new prototypes based on a prior distribution dictated in the stick-breaking style of Sethuraman [44]. With these modifications, the AdaGram model is described by Bartnuov and colleagues’ [2] equation: p(Y, Z, β|X,α, θ) = V∏ w=1 ∞∏ k=1 p(βwk|α) N∏ i=1 p(zi|xi, β) C∏ j=1 p(yij|zi, xi, θ)  , where α is an an extra hyper-parameter that, together with the frequency of a word, influences the number of senses that word will have; Z = {zi}N i=1 represents the senses for all words; and β is a Dirichlet Process parameter. Accordingly, the loss function and training updates are modified to accommodate these changes. The reader is advised to review the original article for full details or an in depth explanation of the network design. It is relevant for us to mention though, that the training algorithm’s complexity scales only linearly with the hyper-parameter T Because T represents the maximum number of possible representations for each word, usual values of T are fairly small so the algorithm maintains its efficiency. 2.3 OpenCog: an open source platform for AGI Inspired by the architecture and dynamics of a human brain, OpenCog is an open source software framework that offers a diverse collection of cognitive algorithms [38]. Ranging from perception and learning to creativity and socialization or mem- ory and language, among others, the platform hosts a carefully selected combination of algorithms that tackle different kinds of knowledge. The objective is to imitate the environment and development of young human children by integrating tasks of different complexity involved in all major aspects of human intelligence. Most of its source code is written in C++, but in the attempt to make it scalable and robust, it offers withal support for Scheme (guile) and Python bindings. The mission of the OpenCog project is to achieve Artificial General Intelligence (AGI); that is, developers of the project aim to mimic with computers the general intelligence at the human level, and hope to one day go beyond it. In a shorter term, the goal is to provide to the technological community with “code that can be used (piece by piece or as a whole) to help make practical software applications smarter” [38]. Even though the project is at an early stage of development, its lead researchers5 have a long trajectory in the AI field. Ben Goertzel6, founder and leader of the OpenCog project, has published a number of books and articles about AGI 5 https://opencog.org/leadership/ 6 https://en.wikipedia.org/wiki/Ben_Goertzel 17 https://opencog.org/leadership/ https://en.wikipedia.org/wiki/Ben_Goertzel 2. Related Work (term which he coined) [22, 52, 21, 23], leads the AI efforts in a handful of compa- nies7, and organizes a yearly Conference on AGI8. What’s more, some of OpenCog’s most robust algorithms are nowadays in use by more than 50 global companies, including Huawei and Cisco [42]. At its core, the OpenCog project consists of a knowledge representation database, the AtomSpace, and a series of assorted projects which are under development. These last include, but are not limited to, natural language processing, attention allocation, spatial and time data managers, embodiment and psychological states modeling, and network server scripts (the cogserver)9. The AtomSpace provides these projects with the work-space and tools necessary for learning and reasoning algorithms to be developed. It comprises an in-RAM database and the associated query/reasoning engine to fetch and manipulate the data on it. Data in the Atom- Space is represented in the form of hyper-graphs10; ergo, the AtomSpace can be seen as a type of graph database in which the query engine is in charge of ‘re-writing’ the graph system and the rule-engine undertakes the general ‘rule-driven inference’. The vertices and edges of such hyper-graphs are called Atoms (hence the name of the database), and can be used to represent not only data information but also pro- cedures. Consequently, many of these hyper-graphs represent executable programs as well as data structures. The ‘programming language’ created to manipulate the atoms and construct the corresponding hyper-graphs is called the Atomese. This language resembles a mixture between several programming languages such as SQL (for query), Prolog (for logic), Lisp (for lambda expressions), and others, because of their utilities. It was not meant to be used by humans, but it was rather designed for machine automation. Atoms can be understood as type instances in type theory or as symbols in many programming languages. They are permanent and immutable, but they can have a value or an interpretation attached to it, which in turn can be mutable. Multi- ple, different values can be associated to a single atom, and each of them can be accessed with a unique key. The triplet (atom, key, value) is called a Valuation. Valuations normally have fleeting, changing values that indicate the truth or like- lihood of the atom they are attached to, but they can also contain other kinds of transient data. There are many different atom types as well, some used for basic knowledge-representation and computer-science concepts. For example there exist atoms representing relations, such as similarity, inheritance and subsets; atoms used for logic, such as Boolean and, or, for-all, there-exists; along with many others. 7 Like singularitynet.io, hansonrobotics.com, mozi.ai, among others. 8 http://agi-conference.org/ 9 Source code at https://github.com/singnet/opencog/ 10 A hyper-graph is a generalization of a graph where each edge can connect more than two vertices at a time. 18 http://agi-conference.org/ https://github.com/singnet/opencog/ 2. Related Work 2.3.1 Representation structures In the context of this thesis there are two atom types that are of particular in- terest: Nodes and Links. These two types of atoms give the basis for all of the representation structures that we use for statistical gathering of language text. Def- initions follow, along with pieces of code that help to clarify concepts for curious programmer-readers. Nodes contain a name or label and a type, which cannot be changed after creation. These two properties are the only ones they have and they form their identification key. As any other atom, once a node is placed in the AtomSpace it is unique: there can only ever be one node with a given (name, type) pair. If a second Node with the same key was to be inserted in the AtomSpace, any valuations associated with it would be merged with the ones from the previously inserted Node. The following piece of code shows the class definition in C++ for a Node atom. class Node : public Atom { private: std::string name; public: Node(Type, std::string &); const std::string& getName() const; std::string toString() const; }; Links are used to connect or associate together other atoms. Unlike nodes, links do not have any name or label; they are identified uniquely by their type and contents. The content of a link is a set of other atoms, which can be ordered or unordered, and is called the link’s outgoing set. The C++ class definition for a Link atom is shown in the following piece of code: class Link : public Atom { private: std::vector _outgoing; public: Link(Type, const std::vector&); const std::vector& getOutgoingSet() { return _outgoing; } size_t getArity() const {return _outgoing.size(); } std::string toString() const; std::string toShortString() const; }; 19 2. Related Work 2.3.1.1 Sentence representation In order to process natural language using the AtomSpace, it is necessary to have ways of representing the sentence structure of a document using Atoms. In Opencog, the sentence representation follows a hierarchical order, with sentences grouped (through links) into documents, parses grouped into sentences, and word instances grouped into parses11. There are also other association structures such as word in- stances corresponding to a word node, or linkage between different word instances (syntactic links), as explained with greater detail in section 3.2. The notation used to define Atoms is the one for constructing them using Scheme language. In a generic way it is as follows: (Type "identifier@UUID" TV) where the Type is the atom type, the identifier is a unique string which serves also as a key handle, UUID is a 128-bit MD5 hash printed in ASCII to make it readable, and TV is a truth value of the atom (a valuation as previously defined). Next, we will introduce the most relevant sentence structures that will be used for statistical gathering within our thesis. • For documents: Multiple sentences within a document are connected using a SentenceLink. Each document is saved in a DocumentNode and every sentence is saved in a SentenceNode. There is only one DocumentNode per document and one SentenceNode per sentence. (SentenceLink (SentenceNode "sentence@UUID") (DocumentNode "document@UUID") ) This type of link only indicates if a given sentence is a part of the document but it doesn’t indicate where it is located within the document. A Sentence- SequenceLink indicates sentence order with a NumberNode. Note that ‘42’ is just an example here. (SentenceSequenceLink (stv 1.0 1.0) (SentenceNode "sentence@UUID") (NumberNode "42") ) • For sentences: A ParseLink connects parses to their original sentences. A parse is saved in a ParseNode which has a SimpleTruthValue associated to it. SimpleTruth- 11 Documentation available at https://wiki.opencog.org/w/Sentence_representation 20 https://wiki.opencog.org/w/Sentence_representation 2. Related Work Values provide a ranking for such parses as a numerical confidence value v. This is grant because some of the parsers that OpenCog supports can output several parses per sentence, each one with an evaluation, which can be saved as such confidence value. (ParseLink (ParseNode "sentence@UUID" (stv 1.0 v)) (SentenceNode "sentence@UUID") ) • For words: Each word in a vocabulary is represented with a WordNode, where the iden- tifier is the word itself. Since the same word can appear multiple times within documents or even sentences, a WordInstanceNode is used to represent the unique occurrence of a word in a particular context. The identifier of a WordInstanceNode is used to tell it apart from other instances of the word, and can be therefore tagged with feature data such as tense, number, and part-of-speech. Given a word instance, a ReferenceLink is used to associate it with the underlying word. (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "wordinstance@UUID") (WordNode "word") ) • For parses: Since different parses may assign different feature data to different word in- stances, eachWordInstanceNode is associated with a particular parse by means of a WordInstanceLink. (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "wordinstance@UUID") (ParseNode "sentence@UUID") ) As with SentenceLinks, this type of link only indicates if a given word instance is a part of the parse, but it doesn’t indicate the structure within the parse. To represent the word order within a parse we use a WordSequenceLink with a NumberNode. Note again that ‘7’ is only an example here too. (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "word@UUID") (NumberNode "7") ) 21 2. Related Work • For word linkages: As mentioned in section 1.1, Link Grammar (LG) is a sentence-parser that creates undirected, planar link-graphs by connecting pairs of ordered words that follow certain relationship rules. Such rules are defined in a language grammar, where word-pair-links are grouped into categories and each link is identified with relation to these categories. LG is integrated and greatly used in OpenCog’s projects. Following the same structures used before, LG links are represented in the Atomspace using an EvaluationLink predicate: (EvaluationLink (stv 1.0 1.0) (LinkGrammarRelationshipNode "Rel-typ") (ListLink (WordInstanceNode "left-word@UUID") (WordInstanceNode "right-word@UUID") ) ) Here Rel-typ is the name of the linkage yielded by the LG relation. Also, it uses an ordered list because even though links are undirected, word order does matter in most of the cases. Notice that once again word-instances are used instead of words because LG might output more than one valid parse and different parses can have different LG linkages. With the purpose of illustrating these sentence representations we show an example next. More examples can be found in the wiki page of the project [38]12. A text document named example.txt containing only the following line: I saw it here would be processed and saved in the structures bellow: (SentenceLink (SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d") (DocumentNode "exammpledoc@308a2446-046c-4f00-bf8d-7e4d2f256875")) (SentenceSequenceLink (stv 1 1) (SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d") (NumberNode "1")) --------------------------------------------------------------------- (ParseLink (stv 1.0 0.83) (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0") (SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")) 12 https://wiki.opencog.org/w/Sentence_representation 22 https://wiki.opencog.org/w/Sentence_representation 2. Related Work (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "I@52c30b4d-5717-47cb-822d-b2caa44f94b9") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "I@52c30b4d-5717-47cb-822d-b2caa44f94b9") (NumberNode "1")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "I@52c30b4d-5717-47cb-822d-b2caa44f94b9") (WordNode "I")) (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "saw@0f223d17-31a6-49fe-9d37-350d50c53926") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "saw@0f223d17-31a6-49fe-9d37-350d50c53926") (NumberNode "2")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "saw@0f223d17-31a6-49fe-9d37-350d50c53926") (WordNode "see")) (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "it@f6c2a2a2-232b-4e33-9c93-72f211b475d3") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "it@f6c2a2a2-232b-4e33-9c93-72f211b475d3") (NumberNode "3")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "it@f6c2a2a2-232b-4e33-9c93-72f211b475d3") (WordNode "it")) (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "here@bf71826c-487e-42df-a941-0ecd3c942a76") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "here@bf71826c-487e-42df-a941-0ecd3c942a76") (NumberNode "4")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "here@bf71826c-487e-42df-a941-0ecd3c942a76") (WordNode "here")) --------------------------------------------------------------------- (ParseLink (stv 1.0 0.17) (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_1") (SentenceNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")) 23 2. Related Work (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "I@3fb606f8-0198-1668-e288-1de81ee1064a") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "I@3fb606f8-0198-1668-e288-1de81ee1064a") (NumberNode "5")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "I@3fb606f8-0198-1668-e288-1de81ee1064a") (WordNode "I")) (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "saw@2307ab5f-af07-9642-21f6-d04d45e355f2") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "saw@2307ab5f-af07-9642-21f6-d04d45e355f2") (NumberNode "6")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "saw@2307ab5f-af07-9642-21f6-d04d45e355f2") (WordNode "saw")) (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "it@b7505847-1b7c-82d3-6f5b-ededbeb85fe8") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "it@b7505847-1b7c-82d3-6f5b-ededbeb85fe8") (NumberNode "7")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "it@b7505847-1b7c-82d3-6f5b-ededbeb85fe8") (WordNode "it")) (WordInstanceLink (stv 1.0 1.0) (WordInstanceNode "here@a8f70463-9441-d485-9711-a3d40a85e08f") (ParseNode "sentenceone@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d_parse_0")) (WordSequenceLink (stv 1.0 1.0) (WordInstanceNode "here@a8f70463-9441-d485-9711-a3d40a85e08f") (NumberNode "8")) (ReferenceLink (stv 1.0 1.0) (WordInstanceNode "here@a8f70463-9441-d485-9711-a3d40a85e08f") (WordNode "here")) 24 3 Methods The ULL pipeline is divided into sub-processes that handle different chores or sub- modules, as explained in Chapter 1; from them, our thesis addresses corpora col- lection, parsing and word sense disambiguation. The task that the whole pipeline wants to tackle is not a simple one: the team building it is not certain if unsu- pervised grammar learning from unannotated corpora with neither grounding nor reinforcement feedback can succeed at all. To make the problem solvable at least to some extent, some simplifications are performed, and some methodologies are pre- ferred over others. A simplification concerning the module addressed in this thesis (i.e. parsing) is that, instead of immediately processing text that resemble every- day language, we collected a series of controlled corpora of increasing complexity. This chapter explains the assumptions made for corpora. It explains as well the methodologies implemented for the parsing and pre-processing learning modules of the ULL pipeline. Finally, it explains the methods that will be used to evaluate the performance of our implementations applied to the aforementioned corpora. 3.1 Controlled corpora Once completed, the ULL pipeline is expected to derive a grammar after process- ing large unannotated corpora. ‘Unannotated’ does not necessarily imply that the texts may not be controlled, at least during the pipeline construction and testing phases. The control that we implement on the learning texts within this thesis project relates to different aspects. One of the aspects refers to the content of the text; for example, we may clean corpus data to contain texts in one language only, as the difficulties of learning more than that would increase substantially. Another aspect relates to the complexity of the texts being inputted. For instance, we may limit the vocabulary size of the text, or consider sentence length and composition as other parameters to vary. This allows us to understand the performance of the algorithms in simpler scenarios, isolate potential issues with them from the intrica- cies inherent to more realistic language, and more easily find ways to overcome them. If one looks back, the first task that we identified in section 1.2 as part of our contri- butions to the OpenCog ULL project was to find or create adequate text corpora to test our implementations. In this thesis we want our algorithm to learn the structure of a language in a manner resembling that of human children. Recall when a baby starts speaking: the first words are always objects, actions and simple concepts, like ‘mom’, ‘eat’ and ‘yes’. Afterwards, they learn the relation between these frequently 25 3. Methods used vocabulary, and only later add more complicated sentence structure, like the use of articles and adjectives or word order. Likewise, we elaborated and selected corpora that start with simple structures and limited vocabulary and then gradually increase their complexity. Next we describe in more detail the corpora we will use, in the given order, for testing and evaluating our implementations, as described in sections 3.2, 3.3 and 3.4: • Proof-of-Concept Turtle (POC-Turtle) corpus - a manually created corpus representing a closed semantic space of 16 different words communicated in the simple Turtle language [19]. It contains a total of 12 sentences with 4 words each, totalling 48 words1. This language is used in semantic web programming and is limited to three words per sentence in a strict Subject-Verb-Predicate triplet grammar. The complexity of such language can be thought closer to complexity of language that non-human primates can learn [20] or that chil- dren at age up to 3 can use [21]. Example sentences: tuna isa fish . parrot has wing . • Proof-of-Concept English with no ambiguity (POC-English-NoAmb) cor- pus - a manually created, closed semantic space of 19 English words with nearly the same frequency of use, communicated in simple grammatical constructions of 4-5 words per sentence. It contains a total of 36 sentences and 206 words, none of which is considered ambiguous for the purposes of this thesis. Example sentences: A daughter is a child . Cake is a food now . • Proof-of-Concept English with ambiguity (POC-English-Amb) corpus - a manually created, closed semantic space of very few English words (43) with nearly the same frequency of use, communicated in simple grammatical con- structions of 4-5 words per sentence. It is similar to the previous corpus (it is actually a superset of it) but contains two words involved in semantically and grammatically ambiguous constructions. Syntactic ambiguity is represented by the word ‘saw’, which can be either a noun or a verb. The word ‘board’ is used to exemplify semantic ambiguity, with two different noun meanings (surface to write on vs. group of people directing some activity). It contains a total of 88 sentences and 573 words. 1 All corpus and vocabulary sizes are given after the corpus has been pre-processed for our purposes (see section 3.1.1). Words, punctuation marks and any other symbols appearing as separated tokens are considered words. 26 3. Methods Example sentences: Mom saw dad with a saw . Dad is on the board of directors . • Child Directed Speech (CDS) corpus - a sub-collection of the CHILDES corpus [32] containing adults’ speech directed to children, characterized by adapted lexicon and reduced grammar complexity. Specifically, it contains the Brent-Ratner-corpus [6, 9] and the UCI-Brent-Syl-corpus [10]2. It contains a total of 4k unique words in 38k sentences, for a total of 130k words. Example sentences: Did you wanna sit down ? And they all rolled over and one fell out • Gutenberg Children corpus - a compendium of books for children contained within Project Gutenberg3, following the selection used for the Children’s Book Test of the Babi CBT corpus [25]4. It contains a vocabulary of 40k words in 207k sentences, with a total size of 2.7M words. Example sentences: Alice led the way , and the whole party swam to the shore . " They’re not in the past tense , " retorted aunt Jamesina . • Brown corpus [20] - a 1-million word compilation of printed English prose from 1961, covering 15 different categories including Press, Fiction, Humor, among others5. After pre-processing it (see section 3.1.1), the corpus contains a total of 482k words distributed in 35k sentences, with a vocabulary size of 43k words. This corpus is considered ‘natural’ English, and is the most ad- vanced one among our tested corpora. Example sentences: Only within the framework of a mature relationship characterized by honest appraisals of performance can we provide telling assistance . The battle of the Naktong River is just one example of how the battle cry and the spirit of The Fighting Seventh have paid off . • NLTK’s Penn Treebank Sample - a sample from the full Penn Treebank corpus [34], which is available through the NLTK suite [7]. It contains parse trees of 10% of the whole Wall Street Journal articles inside the original cor- pus. The corpus consists of a vocabulary of 6k words and a total of 2,285 sentences, for a total size of 38k words. Together with the Brown corpus, it represents the most advanced English used in this thesis. The purpose of 2 Available at https://childes.talkbank.org/derived/ 3 https://www.gutenberg.org 4 https://research.fb.com/downloads/babi/ 5 http://www.sls.hawaii.edu/bley-vroman/brown_nolines.txt 27 https://childes.talkbank.org/derived/ https://www.gutenberg.org https://research.fb.com/downloads/babi/ http://www.sls.hawaii.edu/bley-vroman/brown_nolines.txt 3. Methods including a treebank is to use it as a Gold Standard during the evaluation of automatically-generated parses (see section 3.4). Example sentences: The problem involves the motion of small magnetic fields within superconductor crystals , limiting their current-carrying capacity . The Underwood family said that holders of more than a majority of the stock of the company have approved the transaction by written consent . 3.1.1 Pre-cleaner So far we described the content nature of the corpora. Nevertheless, another im- portant aspect to take into account with regards to corpora control is their format. Collections of text like the ones just described normally come with a variety of symbols and content that are not exactly part of the language structure that we’re interested in learning, e.g. HTML markup and graphics, variations of the same punctuation marks, UTF codes, capitalized letters, etc. In an ideal scenario, our language-learner should be able to also understand the function of that content, but at the current stage of the system it would benefit from using more standardized input. In order to ensure that the input is self-consistent and useful to our purposes, we pre-processed all the input corpora with a text-cleaning algorithm6. The pre- cleaner contains several options for pre-processing the input (which can be turned on and off) and can be grouped into the following broad categories: • Character content of the text: lower-casing the corpus, standardizing different versions of some symbols (e.g. quote symbols), removing non-allowed charac- ters, removing html markup, tokenizing special characters. • Structure of the text: sentence splitter, header remover, maximum sentence and word lengths allowed. • Encoding of entities: replacing numbers, URLs, dates, times with encoding tokens (e.g. ‘@url@’), decoding escaped html and UniCode symbols. For the purposes of this thesis, we pre-process the corpora before parsing using the pre-cleaner default values, which imply the following actions: • Determine sentence boundaries and split sentences into one per line. • Remove headers (only for the Gutenberg Children corpus). • Subtitute URL and e-mail addresses to special tokens. Same for dates, times and numbers. • Decode escaped HTML and unicode symbols to their printable version. • Standardize single quotes to one symbol. Same for double quotes, dashes, long dashes. 6Pre-cleaner available at: https://github.com/singnet/language-learning/tree/master/ src/pre-cleaner 28 https://github.com/singnet/language-learning/tree/master/src/pre-cleaner https://github.com/singnet/language-learning/tree/master/src/pre-cleaner 3. Methods • Tokenize quotes, double quotes and several other punctuation marks. • Remove asterisks, underscores, long dashes. • Remove sentences with more than 25 tokens. • Remove words with more than 25 characters. • Convert text to lower-case. 3.2 Parsing The grounding for our parsing efforts in the ULL project7 is the unsupervised, mutual information-based parser originally proposed by Deniz Yuret [55], which was briefly explained in the previous chapter (section 2.1). The parsing pipeline consists of three stages: gathering of frequency information of words and word pairs from the training data, calculating the fractional mutual information (FMI) of those word pairs, and parsing of sentences using the estimated FMI and a Maximum Spanning Tree algorithm (MST). The counting and the parsing stages are independent and consequently can be performed using the same or different sets of training sentences. Each of the stages will be explained subsequently. 3.2.1 Frequency counting Mutual information can be estimated by observing the frequency in which words and word pairs appear in the studied corpora (see section 3.2.2). In order to determine these frequencies, the system includes a text-processing pipeline that scans a text document assuming that each new line represents a new sentence; then it tokenizes the text, which in this case means breaking it into words; and finally it goes through the text again word by word, sentence by sentence, saving the encountered word instances and sentences in related atoms as described in section 2.38. Counting of individual words in this manner is simple: it is possible to attach to each WordNode a count truth value (a numeric valuation for an atom as described in 2.3) that should be updated each time the text-processor runs into a corresponding WordInstanceN- ode. Nevertheless, in this project we are not interested on the appearance of words by themselves but on the presence of words in relation to other words. Thus we need a different kind of structure. Recall that the Atomspace has a special structure to represent link grammar links. This structure uses a node with the name of the link type to connect twoWordInstan- ceNodes corresponding to the appearance of a word-pair. Since we don’t have any link types at this point and we are enticed to registering the repetitive presence of word-pairs (and not their individual appearance), one possibility is to modify this structure to use a generic link type and use WordNodes instead of WordInstanceN- odes. The structure then would be as follows: 7 Code hosted at https://github.com/singnet/opencog/tree/master/opencog/nlp/learn 8 Unless we use LG, only one parse is assigned to each sentence and it is a sequential parse, since we don’t have any parsing rules available at that point. 29 https://github.com/singnet/opencog/tree/master/opencog/nlp/learn 3. Methods (EvaluationLink (LinkGrammarRelationshipNode "ANY") (ListLink (WordNode "left-word") (WordNode "right-word") ) ) Another option is to create a new link atom where the link type is simply named after its word-pair nature as follows: (EvaluationLink (PredicateNode "*-Sentence Word Pair-*") (ListLink (WordNode "left-word") (WordNode "right-word") ) ) In either case, the presence and frequency (N(x, y) and P (x, y); see section 3.2.2) of these word-pairs in the text can be attached to these links as truth values. However, to detect these ‘presences’ we need to process the text in an observation mode that has a way to determine when the respective counts should be updated. In this thesis project we considered three variations or modes for counting word- pairs: • LG-any: In this mode, for each sentence we sample a limited number of parses using the Link Grammar Parser in random parsing mode (‘any’ mode), and then count the pairs of words linked in these parses. Hence, the structure used for keeping these counts is the first of the two mentioned above. The number of different parses per sentence to consider should be given as a parameter. Here we fixed it to 100. • Clique-WIN: In this mode we use a Cartesian combination of per-sentence words that are within a given window size. Viz., per sentence, the observer iterates over each word and pairs it with every other word located within a determined maximum distance to its right. Distance in this context is de- fined as the difference between the words’ positions in the sentence; as such, neighboring words have distance of 1. The window size or maximum distance between words in a pair must be given as a parameter to the observer. For our experiments we fixed it to 6, based in linguistic evidence that (most) syn- tactic information in a sequence of words lies within this distance [28]. The counts for this kind of pairs are held as values in the second type of word-pair structures mentioned above. 30 3. Methods • Clique-WIN-dist: In this mode we use a Cartesian combination of per- sentence words that are within a given window size, and account for pair distance. That is, we do the same as in the Clique-WIN method, except that now each word-pair is counted a number of times determined by the: distance (as defined in Clique-WIN) between the words. In the Clique-WIN method in every iteration that we pair up words we increase their count by one; in this method we increase their count by the quotient WIN/d, where WIN is the window size and d is the actual distance between the words, thus favoring words that appear closer together in a sentence. Once again, the window is set to 6 and the structure used to keep the counts is the second of the ones described above. 3.2.2 Mutual information calculation Lexical attraction was defined in section 2.1 as the MI held in a syntactic relation. However, because the appearance of a specific word in particular place in a sentence is not the random variable itself but an instance of it, what Yuret [55] really calcu- lated in his parser was the FMI of the word pairs. Furthermore, we are interested in ordered word pairs, not individual words, so in the context of this thesis the random variables used for defining the FMI of an ordered word pair are as follows: • X is the random variable of the event of having a word x related to any other word that is at its right in a sentence. Let * be a wild-card, i.e. any word in the vocabulary of a language, then X represents the event of observing the ordered word pair (x, ∗). • Y is the random variable of having a word y related to any other word that is at its left in a sentence. Similarly, Y represents the event of observing the ordered word pair (∗, y). Moreover, if a word x has another word y to its right, then the word y has word x to its left; and p(X = x, Y = y) = p(Y = y,X = x) is the probability of observing the word pair (x, y) in exactly that order. The random variables X and Y do not have disjoint sample spaces given that words can be syntactically related to both, words at their left and right in a sentence. Hence, to avoid confusion, we will set the notation P (x, y) to be the probability of observing the word-pair (x, y) in a language in that same order9. Using this notation and the notion of a wild-card explained before, we can rewrite the equation to calculate the FMI of a word pair (x, y) as: FMI(x, y) = log2 P (x, y) P (x, ∗)P (∗, y) This obviously implies that p(X = x) = P (x, ∗) and p(Y = y) = P (∗, y). Also notice that P (x, y) and P (y, x) represent two different things. 9 In this sense, P (y, x) would be without ambiguity the short for p(X = y, Y = x). 31 3. Methods Considering that it is impossible to know the relative occurrence of ordered word- pairs in an entire language, we will use the word and word-pair counts collected from observing texts (as described in the previous section) to estimate it. After running the text-processor over a corpus, the Atomspace will gather statistics and hang them in the created sentence structures. In particular, it will contain the number of times each ordered pair (x, y) was observed in the text, which we will note as N(x, y)10. Using these, we can estimate the value of P (x, y) as: P (x, y) ≈ N(x, y) N(∗, ∗) where N(∗, ∗) is the total number of ordered word-pair observations. In turn, this last one can be computed as: N(∗, ∗) = ∑ x,y N(x, y) At last, we just need to calculate the marginal probabilities. These too can be estimated using the word-pair statistics gathered: P (x, ∗) = ∑ y P (x, y) ≈ N(x, ∗) N(∗, ∗) P (∗, y) = ∑ x P (x, y) ≈ N(∗, y) N(∗, ∗) Here N(x, ∗) is the total number of times the word x was observed paired to the left of any other word in the text; and it can be computed as: N(x, ∗) = ∑ y N(x, y) Similarly, N(∗, y) is the number of times the word y was observed paired to the right of any other word in the text; and it can be computed as: N(∗, y) = ∑ x N(x, y) Replacing the above computations of wild-card probabilities we can estimate the fractional mutual information of an ordered word pair (x, y) as: FMI(x, y) ≈ log2 N(x, y)N(∗, ∗) N(x, ∗)N(∗, y) Finally, recall that FMI is a ratio that compares the probability of observing the two words together in a specific order vs observing them separately, and therefore lies in the range (−∞,∞). 10 Note that N(x, y) is absolutely not the same as N(y, x) either. 32 3. Methods 3.2.3 MST algorithm Once we have a set of ordered word-pairs together with an estimation of their point- wise mutual information, we can proceed to parse the text making use of these statistics. As we described in the introduction, the Opencog project uses an algo- rithm for maximal parsing, but it is not any of the two illustrated by Yuret’s work and mentioned in section 2.1. The reason is that the optimal algorithm has a high complexity to be useful, and we are still willing to sacrifice a little performance ef- ficiency (i.e. a little worse than O(n2)) in order to get more trustful results. Thus, an MST (minimum spanning tree) algorithm was developed within the team, which maximizes the total score of the tree11 instead of minimizing it, and restricts the result to a projective tree as well12. When it comes to projective maximal parsing algorithms, the state-of-the-art best rated one is Eisner’s, which runs in O(n3) time [35]. Nonetheless, it uses dynamic programming and it’s not the most straightforward to implement within OpenCog. The simplest implementations of MST algorithms are the three classical ones: Prim, Kruskal, and Borůvka’s [3, 43]. On one hand, modifying Prim to restrict crossing edges would be simple, but this modification will make the outcome of the algo- rithm heavily dependent on the initialization, which didn’t seem very promising13. Kruskal on the other hand, conveys the impression of being too difficult to enforce with a no-link-cross constraint. Hereinafter, the algorithm we are using is an im- plementation of a variant of Borůvka’s algorithm, which seems like the most robust and fast enough alternative for the project’s needs. Overall, this algorithm takes a sequence of atoms representing an ordered list of words and finds an unlabelled, undirected, projective dependency parse for it. This means that we need a parser-method that feeds the algorithm with such a sequence. Therefrom, we need to process the input text once more, tokenize each sentence into a list of words, and create a sequence of atoms per sentence from the word list (the atoms used here are of the WordNode type). The parser-method then takes each sequence and creates a set of pairs (number.atom) where number corresponds to the position of the respective word in the sentence. Finally it inputs this set as the set V of vertices of a clique graph G = (V,E)14, together with the respective set of edges E and a predefined edge scoring function f : E → R, to the modi- fied MST algorithm which finds a projective tree T in G with maximal total score F = ∑ e∈T f(e). A pseudo-code of such modified MST algorithm is described below in Algorithm 1. The algorithm returns the tree as a list of atom-pairs, each with its associated score. 11 The total score is the sum of the weights of each edge of the tree, as defined in any classical MST algorithm [43]. 12 A projective tree is one such that if we number the vertices and arrange them in order in a line, the edges of the tree can be drawn above the vertices without crossings [35]. 13 The success of the original Prim algorithm lies in the fact that if there is an optimal solution, the algorithm will find it regardless of which is the first vertex chosen. That fact no longer holds if we add the projectiveness restriction. 14 A clique is a graph where there exists an edge between every possible pair of vertices [43]. 33 3. Methods Algorithm 1: Pseudo-algorithm for modified MST parser Input : A clique graph G=(V,E), where V is a linearly ordered set, and an edge scoring function f. Output: The set L of edges representing the links in the parse. Take the edge c=(a,b) from E that has the highest score f(c); Initialize a set L of links assigned to the parse to contain only edge c; Initialize a set U of disconnected vertices to contain V-{a,b}, that is, all the vertices of the graph except the ones adjacent to edge c; Initialize the set C of connected vertices to be {a,b}, that is, a set with just the vertices incident to edge c; while set U is not empty do Let O be the set of all edges with one endpoint in U and the other in C; Initialize the candidate edge v to be nothing; while O is not empty AND v is nothing do Take the edge x from O that has the highest score f(x); if x does not cross any links in L then Let v be x; else Remove v from O; end end if v is nothing then Exit the loop; else Let v=(u,w) where u is in U and w in C; Add v to L; Add u to C; Remove u from U; end end The scoring function we input to the algorithm is very important, because this numeric score is the one being maximized during the parse. It should be a func- tion that receives as input an edge e, such that the distance d(e) between its two endpoints is known, and outputs a real number. In the case of our structure, the distance is simple to calculate given that our vertices are pairs (number.atom). Let v = ((x, a), (y, b)) be an edge that connects word-atom a that is in position x in the sentence with a different word-atom b that is in position y 6= x in the sentence, then such distance is d(v) = |x− y| > 0. A very simple scoring function could just return that distance, but we are interested in a function that favors links with meaningful relations. Thus, in this thesis we experimented with two scoring functions that use both the distance and FMI of word pairs: • FMI: since we are interested in links with high lexical attraction, we use the point-wise mutual information between the two words the edge is trying to 34 3. Methods connect. We also don’t want links that are too long so we will give a scoring penalty to links that have d > 16. Thus, this scoring function will return the FMI to every word pair where it is defined, and a very negative value when it is not or when the distance between the words is greater than 16. • FMI-dist: this scoring function does the same as the one above, except that we account for distance a little further. We don’t only want links between words that are no further apart than 16 places from each other, but we also want to favor links among words that are closer together in general. Thus we will add to the point-wise mutual information of the pair an additional value that is inversely proportional to the distance between the pair. Specifically, when defined and within a distance of 16, the scoring function will output FMI + 1/d, or a very negative value otherwise. Finally, although the concept is very easy to understand, it is important to define what it means for an edge to cross any links. Recall that V is linearly ordered, so a relation < among the vertices should be defined. Let the edge be e = (j, k) such that j < k, for any l ∈ L with endpoints x and y such that x < y, we say e crosses l if one of the two is true: • j < x AND x < k AND k < y • x < j AND j < y AND y < k 3.3 Word-sense Disambiguation After the input sentences are organized in parse structures, the ULL pipeline contin- ues by trying