ODR kommer att vara otillgängligt pga systemunderhåll onsdag 25 februari, 13:00 -15:00 (ca). Var vänlig och logga ut i god tid. // ODR will be unavailable due to system maintenance, Wednesday February 25, 13:00 - 15:00. Please log out in due time.
 

Analysis and Generation of Wikidata Descriptions

dc.contributor.authorXu, Hutai
dc.contributor.authorCao, Yuxiang
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerPeter, Ljunglöf
dc.contributor.supervisorAarne, Ranta
dc.date.accessioned2026-02-05T09:03:41Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractThis thesis explores the structure and generation of descriptive texts for Wikidata entities, focusing on cities, universities, and mathematicians. The goal is to develop a grammar-based, language-independent system for automatic description generation. We begin by analyzing multilingual description patterns across six languages, revealing high structural consistency within languages and substantial cross-language variation, particularly between European and non-European language groups. A detailed property frequency analysis shows that a small number of attributes account for the majority of descriptions. Further label occurrence analysis indicates that while human-readable administrative attributes are well represented, identifiers and spatial data are rarely included in natural language descriptions. To mitigate missing label issues, we design a data augmentation pipeline using GeoNames and OpenStreetMap, significantly improving label coverage across languages. We also compare the grammar-based approach with a Retrieval-Augmented Generation (RAG) system and find that the former performs significantly better in terms of clarity, structural consistency, and multilingual alignment. Our findings inform the design of a multilingual description generation system based on Grammatical Framework (GF), emphasizing clarity, informativeness, and structural consistency. This project is part of a broader collaboration: Bokun Xiao contributed to the development of the core grammar, Imtiaz Ayon focused on building the Bengali grammar, and another team was responsible for the Greek grammar.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310961
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectWikidata
dc.subjectgrammatical framework
dc.subjectmultilingual description generation
dc.titleAnalysis and Generation of Wikidata Descriptions
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 25-173 HX YC.pdf
Storlek:
3.21 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: