Analysis and Generation of Wikidata Descriptions

Loading...
Thumbnail Image

Date

Type

Examensarbete för masterexamen
Master's Thesis

Model builders

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis explores the structure and generation of descriptive texts for Wikidata entities, focusing on cities, universities, and mathematicians. The goal is to develop a grammar-based, language-independent system for automatic description generation. We begin by analyzing multilingual description patterns across six languages, revealing high structural consistency within languages and substantial cross-language variation, particularly between European and non-European language groups. A detailed property frequency analysis shows that a small number of attributes account for the majority of descriptions. Further label occurrence analysis indicates that while human-readable administrative attributes are well represented, identifiers and spatial data are rarely included in natural language descriptions. To mitigate missing label issues, we design a data augmentation pipeline using GeoNames and OpenStreetMap, significantly improving label coverage across languages. We also compare the grammar-based approach with a Retrieval-Augmented Generation (RAG) system and find that the former performs significantly better in terms of clarity, structural consistency, and multilingual alignment. Our findings inform the design of a multilingual description generation system based on Grammatical Framework (GF), emphasizing clarity, informativeness, and structural consistency. This project is part of a broader collaboration: Bokun Xiao contributed to the development of the core grammar, Imtiaz Ayon focused on building the Bengali grammar, and another team was responsible for the Greek grammar.

Description

Keywords

Wikidata, grammatical framework, multilingual description generation

Citation

Architect

Location

Type of building

Build Year

Model type

Scale

Material / technology

Index

Endorsement

Review

Supplemented By

Referenced By