Analysis and Generation of Wikidata Descriptions
Loading...
Download
Date
Authors
Type
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Model builders
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis explores the structure and generation of descriptive texts for Wikidata entities,
focusing on cities, universities, and mathematicians. The goal is to develop a
grammar-based, language-independent system for automatic description generation.
We begin by analyzing multilingual description patterns across six languages, revealing
high structural consistency within languages and substantial cross-language
variation, particularly between European and non-European language groups. A
detailed property frequency analysis shows that a small number of attributes account
for the majority of descriptions. Further label occurrence analysis indicates
that while human-readable administrative attributes are well represented, identifiers
and spatial data are rarely included in natural language descriptions. To mitigate
missing label issues, we design a data augmentation pipeline using GeoNames
and OpenStreetMap, significantly improving label coverage across languages. We
also compare the grammar-based approach with a Retrieval-Augmented Generation
(RAG) system and find that the former performs significantly better in terms of
clarity, structural consistency, and multilingual alignment. Our findings inform the
design of a multilingual description generation system based on Grammatical Framework
(GF), emphasizing clarity, informativeness, and structural consistency. This
project is part of a broader collaboration: Bokun Xiao contributed to the development
of the core grammar, Imtiaz Ayon focused on building the Bengali grammar,
and another team was responsible for the Greek grammar.
Description
Keywords
Wikidata, grammatical framework, multilingual description generation
