Analysis and Generation of Wikidata Descriptions
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis explores the structure and generation of descriptive texts for Wikidata entities,
focusing on cities, universities, and mathematicians. The goal is to develop a
grammar-based, language-independent system for automatic description generation.
We begin by analyzing multilingual description patterns across six languages, revealing
high structural consistency within languages and substantial cross-language
variation, particularly between European and non-European language groups. A
detailed property frequency analysis shows that a small number of attributes account
for the majority of descriptions. Further label occurrence analysis indicates
that while human-readable administrative attributes are well represented, identifiers
and spatial data are rarely included in natural language descriptions. To mitigate
missing label issues, we design a data augmentation pipeline using GeoNames
and OpenStreetMap, significantly improving label coverage across languages. We
also compare the grammar-based approach with a Retrieval-Augmented Generation
(RAG) system and find that the former performs significantly better in terms of
clarity, structural consistency, and multilingual alignment. Our findings inform the
design of a multilingual description generation system based on Grammatical Framework
(GF), emphasizing clarity, informativeness, and structural consistency. This
project is part of a broader collaboration: Bokun Xiao contributed to the development
of the core grammar, Imtiaz Ayon focused on building the Bengali grammar,
and another team was responsible for the Greek grammar.
Beskrivning
Ämne/nyckelord
Wikidata, grammatical framework, multilingual description generation
