Analysis and Generation of Wikidata Descriptions
| dc.contributor.author | Xu, Hutai | |
| dc.contributor.author | Cao, Yuxiang | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Peter, Ljunglöf | |
| dc.contributor.supervisor | Aarne, Ranta | |
| dc.date.accessioned | 2026-02-05T09:03:41Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | This thesis explores the structure and generation of descriptive texts for Wikidata entities, focusing on cities, universities, and mathematicians. The goal is to develop a grammar-based, language-independent system for automatic description generation. We begin by analyzing multilingual description patterns across six languages, revealing high structural consistency within languages and substantial cross-language variation, particularly between European and non-European language groups. A detailed property frequency analysis shows that a small number of attributes account for the majority of descriptions. Further label occurrence analysis indicates that while human-readable administrative attributes are well represented, identifiers and spatial data are rarely included in natural language descriptions. To mitigate missing label issues, we design a data augmentation pipeline using GeoNames and OpenStreetMap, significantly improving label coverage across languages. We also compare the grammar-based approach with a Retrieval-Augmented Generation (RAG) system and find that the former performs significantly better in terms of clarity, structural consistency, and multilingual alignment. Our findings inform the design of a multilingual description generation system based on Grammatical Framework (GF), emphasizing clarity, informativeness, and structural consistency. This project is part of a broader collaboration: Bokun Xiao contributed to the development of the core grammar, Imtiaz Ayon focused on building the Bengali grammar, and another team was responsible for the Greek grammar. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310961 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Wikidata | |
| dc.subject | grammatical framework | |
| dc.subject | multilingual description generation | |
| dc.title | Analysis and Generation of Wikidata Descriptions | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Computer science – algorithms, languages and logic (MPALG), MSc |
