Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models

dc.contributor.authorChen, Yule
dc.contributor.departmentChalmers tekniska högskola / Institutionen för elektrotekniksv
dc.contributor.examinerHammarstrand, Lars
dc.contributor.supervisorSüsstrunk, Sabine
dc.contributor.supervisorRen, Yufan
dc.date.accessioned2025-10-02T14:04:57Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractComplex visual narratives, such as comics, present a significant challenge to Vision- Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains unsolved. To enhance VLMs’ capabilities in this domain, we systematically investigate post-training strategies, including supervised fine-tuning on solutions (SFT-S), supervised fine-tuning on reasoning trajectories (SFT-R), and reinforcement learning (RL). Beyond that, inspired by the emerging “Thinking with Images” paradigm, we propose Region-Aware Reinforcement Learning (RARL) for VLMs, which trains models to dynamically attend to relevant regions through zoom-in operations. We observe that when applied to the Qwen2.5-VL model, RL and RARL yield significant gains in low-level entity recognition and high-level storyline ordering, paving the way for more accurate and efficient VLM applications in the comics domain.
dc.identifier.coursecodeEENX30
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310572
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectcomics
dc.subjectmachine learning
dc.subjectdeep learning
dc.subjectlarge language models
dc.subjectmultimodality
dc.subjectpost-training
dc.subjectagentic reinforcement learning
dc.titleZooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer systems and networks (MPCSN), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Yule_thesis_report_0926[44].pdf
Storlek:
6.13 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: