Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project

Castillo-Domenech, Claudia; Suarez-Madrigal, Andres

Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project

Ladda ner

Primär fil 256408.pdf (1.51 MB)

Publicerad

2018

Författare

Castillo-Domenech, Claudia

Suarez-Madrigal, Andres

Typ

Examensarbete för masterexamen
Master Thesis

Program

Computer science – algorithms, languages and logic (MPALG), MSc

Sammanfattning

The work presented in the current thesis is an effort within the larger project entitled “Unsupervised Language Learning”, aiming to build a system that can learn the grammar of a language by processing a corpus of unnanotated text. In particular, the authors focused on the first steps of the pipeline, including the selection and pre-processing of useful text collections to build and test its performance, as well as the syntactic learning loop in which the system obtains statistics from sentences in a given corpus and leverages them to implicitly learn the syntactic structure of the corpus language. The process gathers statistics from co-occurrence of words in a sentence using different counting methods and estimates the pair-wise mutual information between word pairs. Using this knowledge and using a minimum spanning tree algorithm, the system automatically produces syntactic parses of the ingested sentences. The unsupervised parses, when compared to human-generated or rulebased standards, show a varying quality; however, they always perform better than a baseline of randomly generated parses, implying that the system is indeed learning some of the assumed underlying linguistic content from the texts. From the results we learned that the complexity of the input text is a determining factor on the method that performs best, leading us to conclude that a successful unsupervised parser should be able to, up to some extent, pre-assess this complexity before processing. Also, the outputs of our different parser methods all show that accounting for distance among word pairs when parsing yields better results. Nonetheless, to get a more confident evaluation on this implication it is important to have a standard for comparison that bases itself on the same model assumptions. Additionally, we implemented a disambiguation process based in AdaGram as a way to build distinct representations for different word senses within a corpus, which then annotates the corpus with tags representing different uses of each word. The purpose of this pre-processing step is to break polysemy in the corpora and provide a cleaner input to the parsing pipeline. We report that our experiments show either slight improvement in parse quality, or no significant change, if disambiguated corpora are used as input.

Ämne/nyckelord

Data- och informationsvetenskap, Computer and Information Science

URI

https://hdl.handle.net/20.500.12380/256408

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Statistical parsing and unambiguous word representation in OpenCog’s Unsupervised Language Learning project

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced