Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project

Castillo-Domenech, Claudia; Suarez-Madrigal, Andres

Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project

dc.contributor.author	Castillo-Domenech, Claudia
dc.contributor.author	Suarez-Madrigal, Andres
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)	en
dc.date.accessioned	2019-07-03T14:58:31Z
dc.date.available	2019-07-03T14:58:31Z
dc.date.issued	2018
dc.description.abstract	The work presented in the current thesis is an effort within the larger project entitled “Unsupervised Language Learning”, aiming to build a system that can learn the grammar of a language by processing a corpus of unnanotated text. In particular, the authors focused on the first steps of the pipeline, including the selection and pre-processing of useful text collections to build and test its performance, as well as the syntactic learning loop in which the system obtains statistics from sentences in a given corpus and leverages them to implicitly learn the syntactic structure of the corpus language. The process gathers statistics from co-occurrence of words in a sentence using different counting methods and estimates the pair-wise mutual information between word pairs. Using this knowledge and using a minimum spanning tree algorithm, the system automatically produces syntactic parses of the ingested sentences. The unsupervised parses, when compared to human-generated or rulebased standards, show a varying quality; however, they always perform better than a baseline of randomly generated parses, implying that the system is indeed learning some of the assumed underlying linguistic content from the texts. From the results we learned that the complexity of the input text is a determining factor on the method that performs best, leading us to conclude that a successful unsupervised parser should be able to, up to some extent, pre-assess this complexity before processing. Also, the outputs of our different parser methods all show that accounting for distance among word pairs when parsing yields better results. Nonetheless, to get a more confident evaluation on this implication it is important to have a standard for comparison that bases itself on the same model assumptions. Additionally, we implemented a disambiguation process based in AdaGram as a way to build distinct representations for different word senses within a corpus, which then annotates the corpus with tags representing different uses of each word. The purpose of this pre-processing step is to break polysemy in the corpora and provide a cleaner input to the parsing pipeline. We report that our experiments show either slight improvement in parse quality, or no significant change, if disambiguated corpora are used as input.
dc.identifier.uri	https://hdl.handle.net/20.500.12380/256408
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Data- och informationsvetenskap
dc.subject	Computer and Information Science
dc.title	Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master Thesis	en
dc.type.uppsok	H
local.programme	Computer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: 256408.pdf
Storlek:: 1.51 MB
Format:: Adobe Portable Document Format
Beskrivning:: Fulltext

Ladda ner

Samlingar

Examensarbeten för masterexamen