Adaptive and Generalizable Vision-Language Models

Li, Zhixing

Adaptive and Generalizable Vision-Language Models

dc.contributor.author	Li, Zhixing
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Tatar, Kivanc
dc.contributor.supervisor	Yu, Yinan
dc.date.accessioned	2025-10-16T12:19:25Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	Domain generalization remains a significant challenge for vision-language models, as they are required to perform reliably on previously unseen domains during inference. In this work, we introduce a domain prompt fusion framework aimed at improving the generalization capability of CLIP-based models under domain shift. Our approach integrates three core components: a dual-part soft prompt (comprising domain-agnostic and domain-specific prompts), a domain feature extractor, and a prompt fusion mechanism. The extractor generates domain representations from input images and computes source-domain prototypes, which guide the fusion of prompt-based text features. By weighting and combining domain-aware text features according to their similarity to the input images domain representation, the model achieves improved alignment between visual and textual modalities. We evaluate the proposed method on two widely-used benchmarks: Office-Home and mini-DomainNet. The results demonstrate consistent performance gains over standard zero-shot CLIP and CoOp. Specifically, our method achieves average accuracies of 84.98% and 85.53% on Office-Home and mini-DomainNet, respectively. Extensive ablation studies and visualizations further validate the effectiveness of our design. While a small performance gap remains compared to the current state-ofthe- art method DDSPL, our analysis identifies key areas for future enhancement, including prompt design refinement, class-dependent fusion strategies, and the use of latent domains in place of manual annotations.
dc.identifier.coursecode	DATX05
dc.identifier.uri	http://hdl.handle.net/20.500.12380/310642
dc.language.iso	eng
dc.relation.ispartofseries	CSE 25-22
dc.setspec.uppsok	Technology
dc.subject	Vision-language model, prompt learning, domain generalization, prompts ensembling.
dc.title	Adaptive and Generalizable Vision-Language Models
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Data science and AI (MPDSC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 25-22 ZL.pdf
Storlek:: 8.97 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Storlek:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Beskrivning:

Ladda ner

Samlingar

Examensarbeten för masterexamen