Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Complex visual narratives, such as comics, present a significant challenge to Vision-
Language Models (VLMs). Despite excelling on natural images, VLMs often struggle
with stylized line art, onomatopoeia, and densely packed multi-panel layouts.
To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive
benchmark for VLM-based comic understanding. It spans tasks from
foundational recognition and detection to high-level character reasoning and narrative
construction, supported by dense annotations for characters, poses, and depth.
Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o
and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial
performance deficits across core tasks of our benchmarks and underscoring that
comic understanding remains unsolved. To enhance VLMs’ capabilities in this domain,
we systematically investigate post-training strategies, including supervised
fine-tuning on solutions (SFT-S), supervised fine-tuning on reasoning trajectories
(SFT-R), and reinforcement learning (RL). Beyond that, inspired by the emerging
“Thinking with Images” paradigm, we propose Region-Aware Reinforcement
Learning (RARL) for VLMs, which trains models to dynamically attend to relevant
regions through zoom-in operations. We observe that when applied to the
Qwen2.5-VL model, RL and RARL yield significant gains in low-level entity recognition
and high-level storyline ordering, paving the way for more accurate and efficient
VLM applications in the comics domain.
Beskrivning
Ämne/nyckelord
comics, machine learning, deep learning, large language models, multimodality, post-training, agentic reinforcement learning
