Adaptive and Generalizable Vision-Language Models Master’s thesis in Computer science and engineering Zhixing Li Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s thesis 2025 Adaptive and Generalizable Vision-Language Models Zhixing Li Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Adaptive and Generalizable Vision-Language Models Zhixing Li © Zhixing Li, 2025. Supervisor: Yinan Yu, Department of Computer Science and Engineering Co-supervisor: Arsham Gholamzadeh Khoee, Department of Computer Science and Engineering Examiner: Kivanc Tatar, Department of Computer Science and Engineering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 iv Adaptive and Generalizable Vision-Language Models Zhixing Li Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Domain generalization remains a significant challenge for vision-language models, as they are required to perform reliably on previously unseen domains during in- ference. In this work, we introduce a domain prompt fusion framework aimed at improving the generalization capability of CLIP-based models under domain shift. Our approach integrates three core components: a dual-part soft prompt (compris- ing domain-agnostic and domain-specific prompts), a domain feature extractor, and a prompt fusion mechanism. The extractor generates domain representations from input images and computes source-domain prototypes, which guide the fusion of prompt-based text features. By weighting and combining domain-aware text fea- tures according to their similarity to the input images domain representation, the model achieves improved alignment between visual and textual modalities. We evaluate the proposed method on two widely-used benchmarks: Office-Home and mini-DomainNet. The results demonstrate consistent performance gains over standard zero-shot CLIP and CoOp. Specifically, our method achieves average ac- curacies of 84.98% and 85.53% on Office-Home and mini-DomainNet, respectively. Extensive ablation studies and visualizations further validate the effectiveness of our design. While a small performance gap remains compared to the current state-of- the-art method DDSPL, our analysis identifies key areas for future enhancement, including prompt design refinement, class-dependent fusion strategies, and the use of latent domains in place of manual annotations. Keywords: Vision-language model, prompt learning, domain generalization, prompts ensembling. v Acknowledgements First and foremost, I would like to express my sincere gratitude to my supervisor Yi- nan Yu, as well as Arsham Gholamzadeh Khoee, for their thoughtful and meticulous guidance throughout the course of my thesis work. They helped me define the direc- tion of my research, pointed out the weaknesses in my proposed methods, assisted with the experimental design, and provided invaluable feedback on my writing. I was deeply impressed by their expertise and professionalism, and I will continue to learn from them as role models in my academic development. Secondly, I would like to thank my examiner, Kivanc Tatar, who offered many insightful comments on both my planning report and halftime report. His feedback helped me improve the structure and academic rigor of my thesis. I am also grateful to my opponents, Filip Landin and Luca Modica, who provided valuable suggestions from a fellow students perspective. Their input helped me refine the details of my thesis and made it more understandable and coherent. Finally, I would like to thank the C3SE division at Chalmers University for providing the computational resources that supported this work. We acknowledge the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking and hosted by CSC (Finland) and the LUMI consortium. Zhixing Li, Gothenburg, 2025-06-08 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Research Topic and Motivations . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goals and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Limitations and Risks . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 7 2.1 Domain Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Prompt Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Feature Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Theory 13 3.1 Vision-Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Text Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.3 Modality Alignment . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.4 Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Prompt Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Related Works 21 4.1 Methods Based on the Fixed Soft Prompt . . . . . . . . . . . . . . . 21 4.2 Methods Based on the Dynamically Adjusted Soft Prompt . . . . . . 22 4.3 Comparison Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5 Methods 25 5.1 Soft Prompt Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Domain Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3 Prompt Fusion Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 ix Contents 6 Results 31 6.1 Dataset and Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.3 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . 32 6.3.1 Evaluations on Office-Home . . . . . . . . . . . . . . . . . . . 32 6.3.2 Evaluations on Mini-DomainNet . . . . . . . . . . . . . . . . . 33 6.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.5 Analysis of Changing Soft Prompt Length . . . . . . . . . . . . . . . 35 6.6 Analysis of Domain Feature Extractor . . . . . . . . . . . . . . . . . 35 6.6.1 Change the Dimension of the Domain Feature . . . . . . . . . 35 6.6.2 Change Design . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.7 Analysis of Fusion Mechanism . . . . . . . . . . . . . . . . . . . . . . 37 6.8 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.8.1 Visualization of Domain Shift . . . . . . . . . . . . . . . . . . 38 6.8.2 Visualization of Domain Features . . . . . . . . . . . . . . . . 40 6.8.3 “Ideal” Domain Feature Extractor . . . . . . . . . . . . . . . . 42 6.9 Results Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7 Conclusion 47 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 47 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8 Ethics 49 Bibliography 51 A Appendix 1 I A.1 Visualization of Domain Features . . . . . . . . . . . . . . . . . . . . I A.2 Using UMAP to Visualize Features . . . . . . . . . . . . . . . . . . . I x List of Figures 1.1 Common examples of domain shift in autonomous driving [6]. Vari- ations in architectural styles, weather conditions, and lighting are typical scenarios where domain shift occurs. . . . . . . . . . . . . . . 2 2.1 The spectrum of the number of parameters modified by different meth- ods to adapt pre-trained models to downstream tasks. . . . . . . . . . 7 2.2 Domain invariant features. [17] . . . . . . . . . . . . . . . . . . . . . 8 2.3 Illustration of text prompt learning [7] in (a) and visual prompt learn- ing [29] in (b). [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Illustration of CLIP-Adapter [31]. f is the image feature, and W is the text features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Domain Adaptive Ensemble Learning. [33] . . . . . . . . . . . . . . . 11 3.1 Vision-language model architecture. [1] . . . . . . . . . . . . . . . . . 13 3.2 Transformer architecture. [43] . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Attention mechanism. [43] . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Vision transformer. [25] . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Contrastive learning [1] . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Architecture of ProDA. [52] . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Architecture of CoCoOp. [26] . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Architecture of DDSPL. [56] . . . . . . . . . . . . . . . . . . . . . . . 23 5.1 Domain Prompt Fusion (DPF) architecture. During training, the text encoder and image encoder are frozen, and only the soft prompts (including the domain-agnostic part and domain-specific part) as well as the Domain Feature Extractor (DFE, including the source domain prototypes) are updated. . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Local architectural diagram of the domain feature extractor. The dashed lines indicate processes that occur only during training. . . . . 27 6.1 Example figures from Office-Home and mini-DomainNet . . . . . . . 31 6.2 Distribution of image features of the same class across different do- mains. Three classes were randomly selected from all classes for visu- alization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3 Distribution of image features of different classes within the same domain. Ten classes were randomly selected for visualization. . . . . . 39 xi List of Figures 6.4 Distribution of domain features for the “Alarm Clock” category after extraction by the domain feature extractor. Each domain name indi- cates that it was treated as the target domain, with the DFE trained on the other three source domains. . . . . . . . . . . . . . . . . . . . 40 6.5 Distribution of domain features for different classes. The DFE was trained on source domains Clipart, Real World, and Product. Ten classes were randomly selected for visualization. . . . . . . . . . . . . 41 6.6 Distribution of domain features for the “Alarm Clock” class extracted by the “ideal” domain feature extractor. . . . . . . . . . . . . . . . . 42 A.1 Distribution of domain features for the “Chair” category after extrac- tion by the domain feature extractor. Each domain name indicates that it was treated as the target domain, with the DFE trained on the other three source domains. . . . . . . . . . . . . . . . . . . . . . II A.2 Distribution of domain features for the “Computer” category after extraction by the domain feature extractor. Each domain name indi- cates that it was treated as the target domain, with the DFE trained on the other three source domains. . . . . . . . . . . . . . . . . . . . III A.3 Distribution of image features of the same class across different do- mains. Three classes were randomly selected from all classes for visu- alization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III A.4 Distribution of image features of different classes within the same domain. Ten classes were randomly selected for visualization. . . . . . IV A.5 Distribution of domain features for the “Alarm Clock” category after extraction by the domain feature extractor. Each domain name indi- cates that it was treated as the target domain, with the DFE trained on the other three source domains. . . . . . . . . . . . . . . . . . . . V A.6 Distribution of domain features for different classes. The DFE was trained on source domains Clipart, Real World, and Product. Ten classes were randomly selected for visualization. . . . . . . . . . . . . VI A.7 Distribution of domain features for the “Alarm Clock” class extracted by the “ideal” domain feature extractor. . . . . . . . . . . . . . . . . VI xii List of Tables 6.1 Comparison of accuracy across different methods on the Office-Home dataset. The name of each column represents the target domain used during testing, while the other three domains serve as the source domains for training in that setting. Results are sorted in ascending order of average accuracy. . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Comparison of accuracy across different methods on the mini-DomainNet dataset. The name of each column represents the target domain used during testing, while the other three domains serve as the source do- mains for training in that setting. Results are sorted in ascending order of average accuracy. . . . . . . . . . . . . . . . . . . . . . . . . 33 6.3 Ablation study results on Office-Home dataset. “DAP only” refers to using only the domain-agnostic prompt, with all other modules un- changed. “DSP only” refers to using only the domain-specific prompt, with all other modules unchanged. “Remove DFE” means no longer using the domain feature extractor to guide fusion; instead, the raw image feature is used directly. “Greedy fusion” refers to replacing weighted fusion with directly using the text feature from the domain with the highest similarity. “Average fusion” refers to omitting sim- ilarity calculations and directly using the mean of the text features from all source domains. . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.4 Comparison results on Office-Home dataset. The prompt length is given in the order of “domain-agnostic prompt + domain-specific prompt”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.5 Training and testing accuracies for different domain feature dimen- sions. Results were obtained on the Office-Home dataset with source domains Real World, Product, and Art, and target domain Clipart. “Source domain accuracy” refers to the accuracy on the domain classi- fication task over the source domains, and “Target domain accuracy” refers to the accuracy on the image classification task in the target domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.6 Results of adjusting the design of domain feature extractor on Office- Home dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.7 Results of adjusting fusion temperature on Office-Home dataset. . . . 37 xiii List of Tables 6.8 Comparison between domain features extracted using the “ideal” do- main feature extractor and those extracted by a DFE trained normally on source-domain samples. . . . . . . . . . . . . . . . . . . . . . . . 42 xiv 1 Introduction In the field of computer vision, if we examine the problem from the perspective of how to train models, we can observe that the development of training methodologies has generally progressed through three major stages. In the early stage, traditional training methods require collecting large-scale training data and labeling them for specific tasks [1], which is time-consuming and labor-intensive. Additionally, these methods typically suffered from slow convergence, often requiring a substantial num- ber of training epochs to achieve satisfactory performance. In the second stage, with the emergence of the Pre-training, Fine-tuning, and Pre- diction approach, it became unnecessary to train models from scratch. Instead, a pretrained model could be fine-tuned on the downstream task dataset [2]. Compared to training from scratch, this method offers faster convergence, requires less training data, and can even achieve superior performance. However, this approach still relies on a certain amount of labeled data from the downstream task. Moreover, the fine- tuned model often suffers from catastrophic forgetting [3], a phenomenon in which the model loses the knowledge acquired during pre-training, resulting in significantly reduced generalization ability and poor transferability to other downstream tasks. Recently, with the rapid advancement of language models in the field of natural language processing (NLP), traditional vision models have begun to integrate with language models, giving rise to a new class of models known as vision-language models [4]. These models are pre-trained on large-scale image-text pairs and can be directly applied to downstream tasks without the need for additional fine-tuning. Vision-language models represent a groundbreaking intersection of computer vision and natural language processing, which aim to combine both visual and linguistic information, enabling machines to understand and reason about the world in a manner that closely mimics human cognitive processes. 1.1 Research Topic and Motivations The potential for adaptability and generalizability in vision-language models is par- ticularly exciting. Current vision models, while powerful, often struggle when faced with scenarios that deviate significantly from their training data [5]. This limitation hinders their practical application and scalability. Especially in increasingly practi- cal application scenarios such as autonomous driving and intelligent robotics, models must handle complex, unpredictable real-world environments where it is nearly im- 1 1. Introduction possible to cover all possibilities in the training data. This necessitates models with excellent generalization capabilities, capable of handling unseen scenarios effectively based on the training data. Figure 1.1: Common examples of domain shift in autonomous driving [6]. Variations in architectural styles, weather conditions, and lighting are typical scenarios where domain shift occurs. Vision-language models, represented by CLIP [1], with their strong generalization ability, present a highly promising solution to these challenges. Therefore, this study focuses on exploring ways to further improve the generalization ability of vision- language models, overcoming challenges such as domain shift (as shown in Figure 1.1), and enabling the models to perform better in highly complex and dynamic real-world environments. 1.2 Goals and Challenges This research has two objectives. First, we seek to advance the theoretical under- standing of how semantic information and prompt learning can be effectively inte- grated into vision-language models to improve generalization. Second, we aim to develop practical techniques and architectures that implement these insights, result- ing in models that demonstrate superior robustness and performance across diverse tasks and domains. We expect our research to yield the following outcomes: 1. Improved cross-domain performance, demonstrating enhanced generalization. 2. Effective integration of prompt learning for improved Out-of-Distribution (OOD) generalization. 2 1. Introduction 3. Better adaptability to new, unseen tasks and domains. 4. More robust representations from both modalities that capture deeper seman- tic relationships. However, as described earlier, achieving these goals is highly challenging. For exam- ple, we may need to address the problem of domain shift, which typically refers to the situation where the training and testing data come from different domains, re- sulting in distributional discrepancies. Traditional vision models perform poorly in such scenarios, showing little to no generalization across different domains. Although CLIP has significantly improved this issue by raising the model’s performance on domain-shifted data to an acceptable level [1], it still falls short of the ideal. One contributing factor is that CLIP relies on manually designed hard prompts, which are not only difficult to optimize but also lack flexibility. Prompt Learning (see Section 2.2 for a detailed definition), a technique that enables the automatic opti- mization of soft prompts, has been shown to effectively improve the generalization ability of vision-language models [7]. We want to further enhance vision-language models, especially using prompt learning, enabling them to adapt to a wider range of domains while minimizing the gap between their performance on domain-shifted data and the ideal scenario. 1.3 Research Questions We aim to further enhance the generalization ability of vision-language models through prompt learning techniques. This is a lightweight adaptation strategy that allows the pretrained model to achieve better performance on downstream tasks by optimizing the prompt, without the need to modify the models parameters. Although significant work has been done in this area (see Section 2.1), existing methods still have limitations in addressing the Multi-Source Domain General- ization (MSDG) [8] problem. Specifically, we consider a generalization problem with L source domains Sl = {(xl i, yl i)} nl i=1, each associated with a joint distribu- tion P l XY . Note that P l XY ̸= P l′ XY , ∀l, l′ ∈ {1, · · · , L} and l ̸= l′. The goal of the MSDG problem is to learn a predictive function f : X → Y using source do- main data such that it minimizes the prediction error on an unseen target domain Starget, P target XY ̸= P l XY , ∀l ∈ {1, · · · , L}: min f E[L(f(xtarget), ytarget)], (1.1) where L(·, ·) is the loss function. We assume that we have access to sufficient labeled data from the source domains. However, we do not make any assumptions about the target domain, meaning it could be a completely unseen domain or an arbitrary combination of multiple differ- ent domains. Naturally, we also do not have access to any training data from the target domain. To simplify the problem, we also assume that the target domain and source domains share the same label set. The research questions of this study are as follows: 3 1. Introduction 1. How can we use soft prompt learning to improve the generalization capability of CLIP, enabling it to perform better in the context of the MSDG problem? 2. Can our domain-feature-guided prompt fusion mechanism improve modality alignment between image and text features, compared to existing domain gen- eralization methods? 3. What factors contribute to the strengths and limitations of our method in achieving domain generalization, as revealed through comparative and abla- tion studies? 1.4 Limitations and Risks Due to time constraints, our main focus is on text prompt learning rather than multimodal prompt learning. The former is simpler but overlooks adjustments to the visual branch. By designing a more effective visual prompt or establishing a better interaction mechanism between the text and visual branches, prompts could potentially generalize better. Moreover, we assume that the source domain and target domain share the same label set. However, in real-world scenarios, their label sets are not always fully overlapping, and the target domain often introduces unknown labels. This makes the problem more complex. Due to time constraints, we do not consider this case for now, leaving it for future improvements. Finally, given that our focus is on the domain generalization problem, we do not conduct additional evaluations of the model on standard image classification tasks. This is primarily because domain shift is absent in such settings, rendering most of the components in our proposed framework ineffective. In this project, we used publicly available datasets and code developed by others. The dataset authors have granted permission for unrestricted use of the data for non-commercial academic research purposes. The code is released under the MIT License, which allows free usage within the licenses terms and conditions. The datasets do not contain any sensitive information, and as such, we believe their use poses minimal privacy risks. For a more detailed discussion of ethic topics, please refer to Chapter 8. 1.5 Thesis Outline This thesis is organized into 8 chapters. In Chapter 1, we introduce the project background, research motivation, objectives, questions, and limitations. In Chapter 2, we present the main findings from our literature review, covering commonly used approaches for addressing the domain generalization problem, as 4 1. Introduction well as techniques related to prompt learning, feature adapters, ensemble learning, and established benchmarks. In Chapter 3, we provide a detailed explanation of the theoretical foundations of our work, specifically the operational principles of vision-language models and the design philosophy of prompt learning. In Chapter 4, we introduce recent methods that leverage prompt learning techniques to improve the generalization ability of vision-language models. We also provide a brief comparison of the similarities and differences among these approaches. In Chapter 5, we elaborate on our proposed method, detailing the overall framework, the design rationale for each module, and the associated theories. In Chapter 6, we present the evaluation results of our method, including the datasets used, baseline comparisons, implementation details, and testing metrics, along with a comprehensive analysis of the outcomes that highlights both strengths and weak- nesses. In Chapter 7, we summarize the key contributions and findings of our work. In Chapter 8, we discuss ethical topics related to our project. 5 1. Introduction 6 2 Background There is still a desire to further improve the generalization ability of vision language models, especially in cases requiring a certain level of prior knowledge. The spectrum of typical methods and the number of parameters that need to be adjusted are shown in Figure 2.1. The simplest and most straightforward approach is to fine-tune the entire pre-trained model or a few layers of it. Theoretically, this method can achieve better performance on specific tasks. However, since pre-trained models are usually quite large, fine-tuning the entire model is time-consuming and demands significant computational resources. Moreover, during the fine-tuning process, the model often loses some of the knowledge learned during pre-training, leading to a decline in its generalization ability. This issue is known as catastrophic forgetting. Figure 2.1: The spectrum of the number of parameters modified by different methods to adapt pre-trained models to downstream tasks. To address these issues, prompt learning [9] offers a promising solution. This ap- proach avoids fine-tuning the pre-trained model. Instead, it learns an optimal prompt, enabling the pre-trained model to better adapt to the downstream task. Alternatively, we can attach a lightweight network to the pre-trained model to mod- ify the features extracted by it. This approach is known as feature adapter [10]. 2.1 Domain Generalization Traditional machine learning and deep learning theories generally assume that source and target data are independently and identically distributed. However, this assump- tion is often difficult to satisfy in real-world applications. To ensure that algorithms or models achieve stable and accurate results in practice, these methods must be capable of handling out-of-distribution (OOD) data and even counteracting domain shift. Domain shift refers to the discrepancy between the distribution of training data and that of test data [11]. The objective of the domain generalization (DG) 7 2. Background problem [12] is to address domain shift, and the specific definition is provided in Section 1.3. Another problem related to domain shift is domain adaptation (DA) [13], which as- sumes access to (unlabeled) training data from the target domain. Compared to DG, DA makes a stronger assumption about the availability of target domain information. For example, in the context of autonomous driving, it is challenging to include every combination of road traffic environments from all cities, weather conditions, time periods, and potential traffic hazard types in a dataset. This implies that some sce- narios are inevitably not represented in the collected data. Consequently, DG places even higher demands on algorithms, requiring strong generalization capabilities to properly or reasonably handle unseen scenarios or conditions. Numerous approaches have been proposed to address the DG problem. Domain alignment is perhaps the most extensively studied method, the core idea of which is to minimize the discrepancies in data distributions across source domains for learning domain-invariant representations [14], like Figure 2.2. Typically, these methods employ a metric to quantify the differences between distributions, such as moments [15], KL divergence [16], or Maximum Mean Discrepancy (MMD) [17], among others. Alternatively, contrastive learning [18] or adversarial learning [19] techniques are used to promote the extraction of domain-invariant features. Figure 2.2: Domain invariant features. [17] Meta-learning is another widely used strategy. A typical example is Model-Agnostic Meta-Learning (MAML) [20], which further divides the training data into meta- train and meta-test sets, training on the meta-train set to enhance performance on the meta-test set. However, these methods generally aim to learn an improved initialization, so that only a few additional training rounds on the target task are needed to achieve satisfactory performance. In addition, data augmentation is a widely used technique for enhancing the gener- alization capabilities of machine learning and deep learning models. Its core idea is straightforward: by applying transformations such as scaling, cropping, rotating, and altering color [21], or by mixing image features [22], it is possible to simulate the effects of domain shift to a certain extent, thereby enabling the model to learn how to extract more robust features. Beyond manually designed augmentation strate- gies, some researchers have explored using neural networks to learn effective data augmentation patterns that specifically improve the model’s generalization perfor- mance [23]. Except for these three common approaches mentioned above, numerous other meth- 8 2. Background ods can be employed to address the DG problem, such as ensemble learning, learning disentangled representations, regularization techniques, and reinforcement learning [14]. However, most current approaches are confined to traditional vision models, such as CNNs [24] or ViTs [25], and there is relatively little research on domain generalization in vision-language models. Our research is dedicated to bridging this gap. 2.2 Prompt Learning In summary, prompt learning primarily focuses on three directions: text prompt learning, visual prompt learning, and multimodal prompt learning. Unlike discrete prompt engineering in natural language, text prompt learning optimizes prompts in the continuous word embedding space, which may not correspond to any actual natural language text. These soft prompts may encode information that is more effective for the model than natural language text, allowing them to better guide the model in performing downstream tasks. For instance, CoOp [7] embeds a category name into a string like “[V ]1, [V ]2, ..., [V ]M , [CLASS]” where each [V ] represents a word embedding. By minimizing the classification loss on the downstream task, an optimal prompt rep- resentation (shared or distinct) for each category can be learned. CoCoOp [26] extends this idea further by finding an optimal prompt for each image to better de- scribe its content. DAPrompt [27] proposes encoding domain information into the text prompt to effectively enhance the model’s domain generalization ability. They achieve this by learning a domain-agnostic prompt to capture domain-independent information and training domain-specific prompts for each domain. There are two main implementations of visual prompts. The first approach, similar to text prompts, adds a set of learnable parameters as prompts in the input layer of a Vision Transformer, such as VPT [28]. The second approach introduces pertur- bations to the input image to serve as a prompt, these perturbations serve as visual cues to guide the model in extracting more informative features [29]. Multimodal prompt learning combines both text and visual prompt learning approaches, like MaPLe [30]. Currently, an increasing number of studies are attempting to leverage multimodal information for complementary advantages and mutual enhancement to improve the generalization ability of vision-language models. Although prompt learning has been proven to effectively enhance the generalization ability of vision-language models and improve their performance on specific down- stream tasks, it also has certain limitations. For instance, while it avoids fine-tuning the pre-trained model, the training cost of text prompt learning may remain high, especially when generating descriptions tailored to specific images [26]. Moreover, the performance improvement on downstream tasks often comes at the expense of reduced generalization ability on other tasks [7]. Whether it is worth sacrificing the knowledge already learned by the pre-trained model for the sake of performance on a specific task requires careful consideration and case-by-case analysis. Finally, these learned prompt representations are often difficult to interpret. We do not have 9 2. Background Figure 2.3: Illustration of text prompt learning [7] in (a) and visual prompt learning [29] in (b). [4] a clear understanding of the circumstances under which these representations are applicable or when they may fail. 2.3 Feature Adapter Unlike prompt learning, feature adapter does not improve generalization ability by modifying the input to the pre-trained model. Instead, it focuses on modifying the features extracted by the pre-trained model, enabling the model to capture more effective and task-specific features. For example, CLIP-Adapter [31] adds a lightweight MLP network after the original image encoder and text encoder to further extract features, which are then combined with the original features through residual connections. Figure 2.4: Illustration of CLIP-Adapter [31]. f is the image feature, and W is the text features. This method is simpler to implement and train compared to text prompt learning, while the residual fusion ensures that the model can better balance the knowledge already learned by the pre-trained model with the new knowledge acquired for spe- cific downstream tasks. Of course, one drawback is that feature adapters typically require more trainable parameters than prompt learning. Additionally, they involve more hyperparameters, which often makes the tuning process more complex. 10 2. Background 2.4 Ensemble Learning Ensemble learning typically refers to the simultaneous training of multiple copies of the same model, with each copy trained on different subsets of the training data to ensure diversity [32]. During inference, the ensemble aggregates the outputs of the individual sub-models to obtain a more accurate prediction. This design results in predictions that are more robust than those of a single sub-model, demonstrating a stronger ability to resist noise and perturbations, and has proven effective in addressing the DG problem. A typical strategy for enhancing model generalization is to train a separate backbone or classifier head for each source domain. For instance, [33] proposed Domain Adap- tive Ensemble Learning (DAEL), which comprises a CNN feature extractor shared across all source domains alongside individual classifier heads for each domain. By coordinating the outputs of these domain-specific classifier heads, the model can effectively handle input images from various domains. Figure 2.5: Domain Adaptive Ensemble Learning. [33] Another strategy avoids explicitly training multiple sub-models and instead aggre- gates weights from different training stages of a single model. Since the model tends to focus on different features at various stages of training, integrating these weights not only improves generalization but also significantly reduces the time and space overhead compared to training several models. In [34], the authors introduced PromptSRC, a method that combines prompt learning with ensemble learning by performing a weighted fusion of soft prompts from different training stages, where the weights are sampled from a Gaussian distribution. They argue that self-ensemble of soft prompts enables the integration of useful knowledge acquired at various stages, thereby effectively enhancing model generalization. Although ensemble learning methods have demonstrated considerable potential in enhancing model generalization, they also come with certain drawbacks. Training multiple models undoubtedly reduces training and inference efficiency, introduces additional parameters, and increases storage requirements. Moreover, designing appropriate ensemble weights is crucial for effective integration; if the weights are 11 2. Background not properly calibrated, they may not only fail to improve generalization but could also degrade model performance. 2.5 Benchmarks Currently, there are numerous datasets available for evaluating the performance of vision-language models. We focus on using vision-language models for image classification tasks. General-purpose datasets such as ImageNet [35] and CIFAR-10 [36] can be used to evaluate a model’s performance on general tasks. Additionally, task-specific datasets like Stanford Cars [37] and EuroSAT [38], which may require certain prior knowledge, can be used to assess the model’s performance on specific tasks. The most commonly used metric is accuracy and its variants, such as the arithmetic mean or harmonic mean accuracy across different categories or domains. To evaluate the generalization ability of a model, the most common approach is zero-shot learning [39], where the model is applied directly to a new task without any fine-tuning. Additionally, linear probing [1] can be used to assess the feature extraction capability of the model. In this method, the backbone network is frozen, and a linear classifier is trained on a new dataset. The classification performance is then used to evaluate the effectiveness of the extracted features. Currently, the generalization ability of models is often evaluated by directly applying the trained model to new datasets. However, there are also datasets specifically designed to test a model’s ability to handle domain shift, such as Office-Home [40] and DomainNet [41]. Accuracy is typically used as the evaluation metric in these cases. 12 3 Theory In this chapter, we introduce two key theoretical foundations of our work. First, we provide a detailed explanation of the operational principles of vision-language models; subsequently, we present the design philosophy behind prompt learning techniques. 3.1 Vision-Language Models In simple terms, a vision-language model is a type of pre-trained vision model de- signed to better address zero-shot prediction problems in visual recognition tasks by learning associations between images and text. Typically, this neural network consists of two components: an image encoder responsible for extracting image fea- tures and a text encoder for extracting text features, as shown in Figure 3.1. Given an image-text pair as input, the model solves visual recognition tasks by perform- ing some form of matching between the image features and text features. As a pre-trained model, the vision-language model can be applied to various downstream tasks, including image classification, semantic segmentation, object detection, and image/text generation, among others. For the image encoder, models based on convolutional neural networks, such as ResNet[24] and ResNet-D[42], are commonly used. Alternatively, transformer-based models like Vision Transformer (ViT) [25] are also popular choices. For the text encoder, Transformer [43] and their variants remain the primary models in use. Figure 3.1: Vision-language model architecture. [1] 13 3. Theory 3.1.1 Text Encoder In this section, we briefly introduce the Transformer architecture used as the text encoder. As shown in the Figure 3.2, the architecture consists of two main compo- nents: an encoder and a decoder, each composed of multiple stacked Transformer blocks. The encoder transforms the input text sequence into a sequence of em- bedding vectors, while the decoder reconstructs the text sequence based on these embeddings. Each Transformer block contains several key components, including a multi-head self-attention layer, a fully connected feed-forward network, residual connections, and layer normalization. Figure 3.2: Transformer architecture. [43] The attention mechanism is the core component of the Transformer architecture. Its fundamental principle is illustrated in the Figure 3.3. In simple terms, it involves computing a similarity score between a query vector q and a key vector k, and then using this score to perform a weighted aggregation of the corresponding value vectors v. The matrix formulation is given as follows: Attention(Q, K, V ) = Softmax(QK⊤ √ dk )V. (3.1) To enable the model to capture different aspects of the same sequence, the self- attention mechanism is replicated multiple times, resulting in the multi-head atten- tion mechanism. In simple terms, the Q, K, V matrices are first linearly projected into multiple subspaces. Within each subspace, attention is computed indepen- dently, and the outputs are then concatenated. This approach enhances the models 14 3. Theory Figure 3.3: Attention mechanism. [43] representational capacity by allowing it to attend to diverse semantic information simultaneously. The formula is as follows: MultiHead(Q, K, V ) = Concat(head1, · · · , headh)W O where headi = Attention(QW Q i , KW K i , V W V i ). (3.2) Here, h is the number of head, W is the projection matrix, W O ∈ Rdmodel×dmodel , W Q i ∈ Rdmodel×dq , W K i ∈ Rdmodel×dk , W V i ∈ Rdmodel×dv , where dmodel is the embedding dimen- sion, and dq = dk = dv = dmodel h . In addition to the self-attention layer, another important component within each Transformer block is the feed-forward network, which further transforms the repre- sentation of each token individually. It consists of two fully connected layers with a ReLU activation function in between: FFN(X) = W2σ(W1X), (3.3) where W1, W2 are two parameter matrix, and σ(·) represents the activation func- tion. Furthermore, both the self-attention layer and the feed-forward network are followed by residual connections and layer normalization [44] to stabilize training and facilitate gradient flow. It is worth noting that the Transformer architecture does not contain any recurrent structures; all input tokens are processed in parallel. To incorporate positional information from the sequence, positional encodings are added to the embedding of each token: PE(pos, 2i) = sin ( pos 10000 2i dmodel ) , (3.4) PE(pos, 2i + 1) = cos ( pos 10000 2i dmodel ) , (3.5) where pos is the position and i is the dimension. These encodings enable the Trans- former to learn both the absolute positions of individual tokens and the relative positional relationships between them. 15 3. Theory In CLIP, the authors employed a 12-layer Transformer model with a hidden size of 512 and 8 attention heads. The input text sequence is enclosed with special tokens: [SOS] at the beginning and [EOS] at the end. The textual representation is extracted from the [EOS] token at the final layer of the Transformer, followed by layer normalization and a linear projection into the multimodal embedding space. [1] 3.1.2 Image Encoder In this section, we introduce the Vision Transformer (ViT) model, which is com- monly used as the image encoder. Although CNN-based visual models are also widely used, their performance is generally inferior to that of ViT [1]. Due to time constraints, this study focuses exclusively on vision-language models based on ViT. Figure 3.4: Vision transformer. [25] The architecture of ViT is illustrated in the Figure 3.4. Overall, it closely follows the structure of the standard Transformer model [43]. However, ViT utilizes only the encoder part of the Transformer to extract image features and employs an MLP network to perform the classification task. Since the Transformer can only process 1D sequence inputs, the authors first divide the original 2D image x ∈ RH×W ×C into a series of image patches xp ∈ RN×(P 2C), where (H, W ) is the resolution of the original image, C is the number of channels, and (P, P ) is the resolution of the image patch. The number of resulting patches is given by N = HW P 2 . After patching, each 2D image patch is further transformed into a 1D vector (embedding) via a linear projection, making it suitable for processing by the Transformer. To obtain a representation for each image, or equivalently the image feature, the authors adopt a method similar to that used in BERT [45]. Specifically, a special [CLS] token is prepended to the sequence of image patch embeddings to learn a holistic representation of the image. The final image feature is then extracted from the output corresponding to the [CLS] token at the last layer of the Transformer. As in the original Transformer, 1D positional encodings are added to each embedding vector to incorporate the sequential information among image patches. 16 3. Theory The authors point out that a significant distinction between ViT and CNN-based models lies in the amount of inductive bias. In CNNs, inductive priors such as lo- cality, two-dimensional spatial structure, and translation or rotation invariance are inherently embedded in the model through the use of convolutional kernels. In con- trast, ViT exhibits much less such bias. This is because the attention mechanism operates globally, with limited emphasis on local spatial relationships, and the po- sitional encodings do not impose any explicit assumptions about spatial structure. As a result, all knowledge about spatial relationships must be learned from scratch during training. This reduced inductive bias is considered a key characteristic of ViT and may partly explain its superior performance compared to traditional CNN models. In the CLIP model, the authors adopted an implementation that is nearly identical to the one provided in the original ViT paper [1]. After extracting image features using ViT, no further processing is applied, meaning that the image feature space directly serves as CLIPs multimodal embedding space. Aligning text features and image features within the same feature space is essential to ensure the meaningful- ness of subsequent similarity comparisons. 3.1.3 Modality Alignment Figure 3.5: Contrastive learning [1] The design of the objective function is also a key aspect. Contrastive objectives [46] are among the most commonly used objectives for vision-language models, as exemplified by CLIP [1]. The basic idea is to maximize the cosine similarity between the features of true image-text pairs while minimizing the cosine similarity between all other combinations of features, as shown in Figure 3.5. More precisely, in CLIP, the contrastive loss between image (I) and text (T) consists of two components: image-to-text contrastive loss LI→T and text-to-image contrastive loss LT →I . The formula is as follows [47]: LT →I = − B∑ i=1 log exp ( zT i · zI k/τ ) ∑B j=1 exp ( zT i · zI j /τ ) , (3.6) 17 3. Theory LI→T = − B∑ i=1 log exp ( zI i · zT k /τ ) ∑B j=1 exp ( zI i · zT j /τ ) . (3.7) Here, B represents the batch size, τ is the temperature, and z denotes the image or text features. Finally, the total loss is L = LT →I + LI→T . It is important to note that CLIP employs a cosine similarity-based InfoNCE loss [48], rather than other contrastive loss functions such as Euclidean distance-based pair loss [49] or triplet loss [50]. Although the original paper does not explicitly discuss the rationale behind this choice, we speculate that it may be due to the following reasons: 1. Direction is more important than magnitude. When computing Eu- clidean distance, differences in vector magnitude often dominate the distance calculation. In contrast, cosine similarity normalizes the feature vectors, pro- jecting them onto a unit hypersphere. This normalization encourages con- trastive learning to focus on angular differences, enabling the model to learn more effective representations in which similar samples lie close together and dissimilar samples are evenly distributed across the hypersphere [51]. 2. The curse of dimensionality. In high-dimensional spaces, vector distribu- tions tend to be extremely sparse, and distances between points can become uniformly large. Under such conditions, Euclidean distance becomes less dis- criminative, making it difficult to distinguish between different feature types. 3. Numerical stability. Cosine similarity has a bounded range of [−1, 1], whereas Euclidean distance ranges from 0 to ∞. This bounded nature of cosine similarity helps improve numerical stability during training, reducing the risk of exploding gradients or instability caused by large distance values. Additionally, for generative tasks, such as generating images or textual descriptions, a generative objective can be employed. For tasks involving matching between images and text, an alignment objective can be used. Vision-language models are typically trained on extremely large datasets. Take CLIP as an examplethe authors constructed a new dataset containing 400 million (image, text) pairs, named WebImageText (WIT) [1]. Unfortunately, the authors did not release their dataset, so we have no way of knowing more details about the training data. 3.1.4 Downstream Tasks As previously mentioned, vision-language models can be applied to a wide range of downstream tasks. Here, we use CLIP for a zero-shot image classification task as an example to illustrate how a pretrained vision-language model can be applied to downstream tasks. To enable CLIP to perform an zero-shot image classification task, in addition to providing the image to be classified, we usually need to manually create a prompt 18 3. Theory that includes the possible class names, such as “a photo of a [CLASS].”, where [CLASS] represents the potential category labels. After extracting the image and text features using the image encoder and text en- coder respectively, we calculate the cosine similarity between the image feature and each text feature. The classification probabilities that the input image xi belongs to class k can then be computed using the following formula: P (yi = k|xi, wk) = exp(cos(g(wk), f(xi))/τ)∑K j=1 exp(cos(g(wj), f(xi))/τ) . (3.8) Here, g(·) represents the text encoder, f(·) represents the image encoder, w repre- sents the designed prompt, τ is the temperature parameter, and cos(·, ·) denotes the cosine similarity: cos(a, b) = a · b ||a||||b|| . (3.9) The reason for using cosine similarity as the metric for computing classification probabilities is closely related to CLIPs pre-training objective. As introduced in Section 3.1.3, CLIP is trained using a contrastive loss, which in practice maximizes the cosine similarity between true image-text pairs. Although the original loss for- mulation uses the dot product, the features are normalized to unit length prior to similarity computation, meaning that the dot product is effectively equivalent to co- sine similarity. As a result, in the embedding space, an image feature will have the highest cosine similarity with its corresponding text feature. If alternative distance metrics, such as Euclidean distance, were used, this correspondence would no longer be guaranteed. Compared to traditional classification approaches that rely on an MLP-based classi- fier, CLIP exhibits stronger classification capabilities. First, cosine similarity-based classification does not rely on a fixed decision boundary, enabling CLIP to handle previously unseen categories. Second, by simply modifying the prompt fed into the text encoder, one can make the textual description better match the image content, allowing for improved classification performance even without fine-tuning the model. This is also the key principle underlying prompt learning. 3.2 Prompt Learning Since manually designed prompts are difficult to optimize, finding a well-performing prompt can be time-consuming and labor-intensive. To address this, researchers have proposed replacing natural language text with learnable word embeddings. Be- cause these embeddings are continuously distributed in the word embedding space, optimization algorithms can automatically find the optimal prompt. For example, CoOp designs the prompt as a set of learnable embeddings: pk = [V ]1, [V ]2, ..., [V ]M , [CLASS]k, (3.10) 19 3. Theory where [V ] represents the word embedding vector in the word embedding space. It can encode class-specific information, domain-specific information, or any other type of information that may guide the model to complete the downstream task. M is a hyperparameter, control the length of soft prompts. During training, all other parameters of the pre-trained model are kept frozen, and only the soft prompt parameters are optimized using the cross-entropy loss function: Lce = − N∑ i=1 yi log P (ŷ = i|x). (3.11) After obtaining the optimized soft prompt, the probability of an image belonging to each category is calculated using the same formula as CLIP (Equation 3.8). 20 4 Related Works In this chapter, we introduce a series of state-of-the-art methods that leverage prompt learning techniques to enhance the generalization ability of vision-language models. These methods will serve as baselines for comparison with our proposed approach in Chapter 6. We also provide a brief analysis of the similarities and differences among them. Broadly speaking, these methods can be categorized into two groups. The first group uses a fixed soft prompt, meaning that the soft prompt remains unchanged during inference after being trained. For example, CoOp [7], ProDA [52], and BPL [53]. The second group allows the soft prompt to be dynamically adjusted or generated during inference, as seen in methods such as CoCoOp [26], DPL [54], StyLIP [55], DDSPL [56], and SPG[57]. 4.1 Methods Based on the Fixed Soft Prompt CoOp is the first method to introduce prompt learning into the field of vision- language models and remains one of the most well-known and influential approaches in this area. However, its core idea is relatively simple: it replaces manually designed hard prompts with a set of learnable word embedding vectors. We have already pro- vided a detailed introduction to this method in Section 2.2. Figure 4.1: Architecture of ProDA. [52] ProDA aims to enhance the generalization ability of vision-language models by learn- ing a distribution over prompts. The method introduces a learnable set of soft 21 4. Related Works prompts, where each class is associated with multiple prompts. After being en- coded by the text encoder, these prompts produce a distribution of text features for each class in the embedding space. Rather than modeling the prompt set directly, ProDA estimates the distribution of each class by the text features generated from its corresponding prompts. Using multivariate Gaussian modeling, ProDA defines an optimizable upper-bound loss function for training. To promote diversity among prompts during training, the method introduces positional variations in the prompt structure, as well as a semantic orthogonality constraint to enhance the expressive- ness of the prompt set. At inference time, the mean of the text feature distribution for each class is used as its representative text feature. BPL formulates prompt learning as a variational inference problem by introducing a Bayesian framework into the prompt space. Each prompt is composed of a set of fixed learnable vectors added to a global residual vector, which is treated as a latent variable modeled by a learnable Gaussian distribution r ∼ N (µ, Σ). During training, a residual vector is sampled from this distribution and added to all prompt tokens to form a complete prompt. The model is trained by maximizing the variational lower bound, which includes a log-likelihood term for label prediction and a KL divergence term between the posterior and prior of the prompt residual. At inference time, multiple prompts are generated by sampling residuals from the learned distribution, and the corresponding text features are used to produce multiple predictions. The final classification result is obtained by averaging these predictions. 4.2 Methods Based on the Dynamically Adjusted Soft Prompt Figure 4.2: Architecture of CoCoOp. [26] CoCoOp builds upon CoOp by introducing image-conditioned prompt modeling to improve generalization to unseen classes. Specifically, it incorporates a residual adjustment mechanism conditioned on the input image. This mechanism is imple- mented via a lightweight neural network called Meta-Net, which takes the output of the image encoder as input and produces a conditioning vector π = hθ(x). This vector is then used to adjust each prompt token vm as vm(x) = vm + π. DPL does not directly optimize the soft prompts themselves. Instead, it trains a prompt generator, which is a lightweight MLP network capable of dynamically generating soft prompts based on the image features of the input. To improve 22 4. Related Works CLIPs generalization ability when dealing with images from different domains, DPL averages the soft prompts generated for all samples within each source domain to obtain a domain-specific prompt. StyLIP enhances the generalization ability of CLIP in cross-domain image classifi- cation tasks by introducing a multi-scale style-conditioned prompt learning mech- anism. The core idea is to leverage the style information of an image to guide prompt generation. Specifically, StyLIP uses CLIPs image encoder to extract statis- tical information (mean and variance) from multi-level convolutional feature maps, forming both style features and multi-scale content features. The style features are processed by a set of Transformer encoders (style projector) to generate conditional embeddings, which are used to control the generation of a set of prompt tokens. Meanwhile, the multi-scale content features are processed by a content projector and then fused with the text features generated by the text encoder. The resulting fused text features are finally used for image classification. Figure 4.3: Architecture of DDSPL. [56] DDSPL employs disentangled prompt learning to separate information from different domains, as well as to disentangle domain-specific information from class-specific information. Each source domain is associated with a domain-specific prompt and a corresponding domain concept textual feature. During inference, the image feature of the input is first compared with the domain concept textual features to compute similarity scores. These scores are then processed by a domain attribution module, which generates weights for each domain-specific text feature. Finally, the text features are fused through a weighted combination based on these computed weights. SPG proposes a generative adversarial framework for prompt learning, aiming to generate domain-adaptive soft prompts. The method involves two training stages: In the first stage, optimal prompt vectors that best represent the characteristics of each source domain are independently learned through training on each domain. In the second stage, a conditional generative adversarial network (GAN) is constructed, consisting of a generator and a discriminator. The generator takes image features and random noise as input and generates soft prompts. The discriminator aims to distinguish whether a generated prompt matches a real prompt. During inference, 23 4. Related Works only the trained generator is used to generate personalized prompts based on the target image, which are then combined with the class label to perform classification. 4.3 Comparison Analysis Next, we briefly analyze the main similarities and differences among the selected methods. It is important to note that in selecting state-of-the-art methods, we prioritized diversity, and as such, the differences among these approaches are signif- icantly greater than their commonalities. 1. ProDA, BPL, and CoCoOp all model the semantic information of prompts from a probabilistic perspective. However, ProDA models the distribution of text features directly in the embedding space, whereas BPL and CoCoOp model the distribution of embedding vectors in the word embedding space. 2. Although CoCoOp, DPL, StyLIP, and SPG all adopt dynamic prompt gen- eration strategies by training lightweight networks to generate soft prompts on the fly, DDSPL instead trains a set of soft prompts and performs prompt fusion during inference. 3. Among all the methods, only ProDA and DDSPL involve direct optimization of text features. In combination with the results presented in Section 6.3, it is evident that directly optimizing text features is more beneficial for achieving modality alignment, thereby improving the generalization ability of the model. 24 5 Methods Figure 5.1: Domain Prompt Fusion (DPF) architecture. During training, the text encoder and image encoder are frozen, and only the soft prompts (including the domain-agnostic part and domain-specific part) as well as the Domain Feature Ex- tractor (DFE, including the source domain prototypes) are updated. For the domain adaptation problem, since we already have information about the target domain and access to some (unlabeled) training data from it, we can design soft prompts specifically tailored to the target domain. This problem has been extensively studied [27], [58]–[61]. However, in the domain generalization scenario, we have no prior knowledge of the target domain. Therefore, it is necessary to design effective mechanisms that fully leverage the information available from the source domains. Inspired by [56], [62], we propose a Domain Prompt Fusion (DPF) framework. It dynamically fuses soft-prompt text features from different source domains based on the domain feature extracted from the input image, in order to achieve better generalization. The overall architecture of our method is shown in the Figure 5.1. Our framework is built upon the CLIP model, which consists of a text encoder for extracting textual features, an image encoder for extracting visual features, and a classification mech- anism based on cosine similarity. For a detailed explanation of these components, 25 5. Methods please refer to Section 3.1. On top of this foundation, our framework additionally includes the following three modules: 1. Soft Prompt, which is further divided into: • Domain-agnostic prompt: captures domain-invariant or domain- independent information. • Domain-specific prompt: captures domain-relevant information specific to each source domain. 2. Domain Feature Extractor (DFE): extracts domain features from the input image to represent its domain characteristics. 3. Prompt Fusion Mechanism: computes fusion weights based on the extracted domain features, enabling the dynamic combination of text features from dif- ferent source domains. Compared to prior works such as CoOp and DDSPL [56], our approach introduces the following key innovations: 1. Adaptation of soft prompt decomposition to domain generalization. This design was originally proposed in the context of single-source domain adaptation [27]. While we adopt a similar conceptual structure, we modify the training procedure to make it suitable for multi-source domain generaliza- tion. For instance, we introduce an orthogonality loss on the domain-specific prompts to encourage disentanglement and improve domain generalization per- formance. 2. A domain-feature-based strategy for computing fusion weights. We observe that samples from different domains exhibit varying degrees of distri- butional shift in the embedding space. Based on this observation, we hypothe- size that domain information embedded in the input images can be leveraged to guide text feature fusion. Accordingly, we propose a fusion weight com- putation strategy driven by domain features, which takes into account the similarity among domains. By contrast, although DDSPL also adopts a similar fusion mechanism, it computes fusion weights based on the similarity between image features and domain-specific text features, without fully leveraging the domain information embedded in the visual modality of the source domains. 5.1 Soft Prompt Design In our settings, each prompt is divided into three parts: domain-agnostic, domain- specific, and class label, as follows: pk = [v]1[v]2 · · · [v]M1︸ ︷︷ ︸ domain-agnostic tokens [d]1[d]2 · · · [d]M2︸ ︷︷ ︸ domain-specific tokens [CLASS]k. (5.1) 26 5. Methods Here, [v] represents the domain-agnostic tokens, [d] represents the domain-specific tokens, and [CLASS] represents the class label. M1 and M2 are hyperparameters that control the lengths of the two types of prompts, and k represents the k-th class. All domains share the same domain-agnostic prompt, while an independent domain- specific prompt is trained for each domain. The training of the soft prompts is performed using the cross-entropy loss function Lce (Equation 3.11). Additionally, we introduce an orthogonality constraint to ensure that the prompts for different domains are as distinct as possible: Lorth = ||G − diag(G)||2F = √∑ i ̸=j (wiwj)2 2 , (5.2) where wi ∈ RM2·dim×1 is the domain-specific prompt vector, dim is the word embed- ding dimension of each tokens (which is 512 in the CLIP). G is the gram matrix of the domain-specific prompt vector, Gi,j = wi · wj, and || · ||F is the Frobenius norm. 5.2 Domain Feature Extractor Figure 5.2: Local architectural diagram of the domain feature extractor. The dashed lines indicate processes that occur only during training. In our design, we aim to fully leverage the domain information embedded in the image to help determine which domain the input image belongs to, or to compute its similarity to different domains. To achieve this, we introduce the Domain Feature Extractor (DFE) module. The DFE is essentially a lightweight MLP network placed after the image encoder, as shown in Figure 5.2. It further extracts domain-related features from the image features produced by the encoder. The DFE consists of three fully connected layers, two dropout layers, and uses the ReLU activation function. Through visualization analysis (see Section 6.8.1), we observe that image features from different domains are mixed together in the embedding space, making them 27 5. Methods difficult to distinguish. In order to differentiate images from different domains, we want the DFE to project image features from the embedding space into a new domain feature space, where features from the same domain are as close as possible, and features from different domains are as far apart as possible. And domain features should not exhibit pronounced class clustering; samples of different classes within the same domain should be uniformly mixed rather than forming separate clusters. We use the prototypical loss [63] as the objective function to achieve this. The computation of the prototypical loss can be viewed as a form of hard clustering, which aligns perfectly with our design objective. Other loss functions, such as con- trastive loss or triplet loss, are based on the similarity or dissimilarity between sample pairs, but they do not simultaneously consider the inter-domain similarity across all samples within a domain or the inter-domain dissimilarity across all sam- ples from different domains. Moreover, the computed prototypes can be directly used to determine the domain of an input sample. This approach is simple, efficient, and highly interpretable, as it eliminates the need to train an additional classifier to separate samples from different domains. Let f(·) represent the image encoder and h(·) represent the DFE. First, we compute the prototype for each domain, which can be interpreted as the cluster center in k-means: pm = 1 |Sm| ∑ xi∈Sm h(f(xi)), (5.3) where Sm is the sample set of domain m. Then, the probability of a sample belonging to domain m can be computed using the following formula: p ( y = m | xi ) = exp(−d(h(f(xi)), pm))∑M j=1 exp(−d(h(f(xi)), pj)) , (5.4) where d(·) denotes the Euclidean distance in the domain feature space, d ( h(f(xi)), pm ) = ∥ h(f(xi)) − pm ∥2. (5.5) The use of Euclidean distance in this context follows the design choices made in the original paper [63]. However, in accordance with recommendations from [51], we apply normalization to the domain features h(f(xi)) prior to computing the prototypical loss, projecting them onto a unit hypersphere. Under this condition, the difference between Euclidean distance and cosine similarity becomes negligible, as ||x − y||2 = 2 − 2xy = 2 − 2 cos(x, y). The total loss function is defined as follows: Lproto = − 1 N N∑ i=1 log p ( y = m | xi ) . (5.6) After training, the DFE saves the prototypes of each source domain computed in the final iteration. During inference, the DFE projects the target domain image features into the domain feature space, and then calculates their cosine similarity to each of the stored source domain prototypes. 28 5. Methods 5.3 Prompt Fusion Mechanism The Prompt Fusion Mechanism is the core of our proposed method. Its computed weights directly determine the effectiveness of the fused prompt and ultimately affect the classification accuracy. In our design, we use the text features from different domains as a set of basis vectors, and then compute a linear combination of these basis vectors to obtain the text feature for the target domain: f̃i = ∑ m αmfm i . (5.7) We fuse the text features of different domains for each class separately. Here, fm i denotes the text feature of class i from domain m, f̃i denotes the fused text feature of class i, and α represents the weights of the domain text features. The fusion weights are computed based on the cosine similarities obtained from the DFE. Currently, we use a relatively simple calculation method: we compute the probability of the target domain feature belonging to each source domain using the cosine similarity, and directly use these probabilities as the fusion weights: αm = exp(cos(h(f(xi)), pm)/τ)∑M j=1 exp(cos(h(f(xi)), pj)/τ) . (5.8) After obtaining the fused text feature for each class, we compute the cosine sim- ilarity between the image feature and these fused text features to perform image classification. 5.4 Training Strategy During training, we keep the original parameters of the CLIP model (both the text encoder and image encoder) frozen, and only train the soft prompts and the Domain Feature Extractor (DFE). • When training the domain-agnostic prompt, we use only the cross-entropy loss (Lce). • When training the domain-specific prompts, we apply both the cross-entropy loss and the orthogonality loss (Lce + Lorth) to encourage diversity between prompts. • For training the DFE, we use the prototypical loss (Lproto). Both the soft prompts and DFE are trained on a dataset that includes all data from all source domains. This means we train them jointly across all source domains, rather than sequentially training on each source domain separately. 29 5. Methods 30 6 Results In this chapter, we will introduce the datasets used, the baselines for comparison, implementation details, and the results of various tests. We will also provide a detailed analysis of these results. 6.1 Dataset and Baseline Our main goal is to evaluate the models domain generalization ability, so we will primarily conduct experiments on the Office-Home dataset. Office-Home is a do- main generalization dataset, which includes four domains (Art, Clipart, Product, and Real-World) with 65 categories collected from everyday objects, totaling over 15,500 images. We also trained and evaluated our method on mini-DomainNet [33], a benchmark dataset specifically designed for domain generalization, which is con- structed by sampling a subset of images from the larger DomainNet [41] dataset. It comprises 140,006 images divided into four domains: Real, Painting, Sketch, and Clipart. We will use the leave-one-domain-out evaluation protocol [33], meaning that in each experiment, we leave one domain as the test set while using the other domains as the training set. The evaluation metric is accuracy. (a) Office-Home (b) Mini-DomainNet Figure 6.1: Example figures from Office-Home and mini-DomainNet We compare our method with zero-shot CLIP and prompt-learning or adapter-based approaches. Zero-shot CLIP refers to directly using the pre-trained CLIP model for image classification on the target domain, without any additional training or fine- tuning. We consistently use the prompt template “a photo of a [CLASS].” as the input to the text encoder. We also select CoOp [7], CoCoOp [26], ProDA [52], CLIP-Adapter [31], DPL [54], 31 6. Results BPL [53], StyLIP [55], DDSPL [56], and SPG [57] as our baselines. Additionally, we retrained and evaluated only zero-shot CLIP and CoOp; results for the remaining methods were obtained from the literature [55]–[57] without further validation. 6.2 Experimental Setup We implement our code based on the CoOp framework and the Dassl library [14], [33]. For the pre-trained CLIP model, we use ViT-B/16 [25] as the image encoder. Both the domain-agnostic prompt and domain-specific prompt are set to a length of 8 tokens. When initializing the soft prompts, we use random initialization, sampling from a Gaussian distribution with a mean of 0 and a variance of 0.02, consistent with the setup in CoOp. The coefficient of the orthogonal loss for the domain- specific prompt is 10.0. The Domain Feature Extractor (DFE) consists of three randomly initialized fully connected layers, the input dimension matches the feature dimension in the embedding space, which is 512; The hidden layer dimension is 256; The output layer dimension is 512. We place a dropout layer with a dropout rate of 0.2 between every two fully connected layers. Except for the number of training epochs, both the soft prompts and DFE share the same optimizer settings. We use the Adam [64] optimizer. The initial learning rate is set to 0.002, and we apply a cosine annealing learning rate scheduler to gradually decrease the learning rate during training. Batch size is 32 on the Office-Home dataset and 256 on mini- DomainNet. The soft prompts are trained for 10 epochs, and the DFE is trained for 200 epochs. The fusion temperature is set to 0.8. As a baseline, CoOp is trained on all source domains. The prompt length is set to 16 tokens, and all other hyperparameters are kept consistent with those used when training our proposed method. All training and testing procedures were conducted on the Alvis server cluster pro- vided by NAISS. We primarily utilized compute nodes equipped with a NVIDIA Tesla A40 GPU and an Intel(R) Xeon(R) Gold 6338 CPU @ 2GHz. 6.3 Comparison with Baselines 6.3.1 Evaluations on Office-Home Table 6.1 presents the testing results of our method compared to the baselines on the Office-Home dataset. Our method outperforms zero-shot CLIP by 2.95% and the classic prompt-learning approach CoOp by 1.47%, demonstrating its effectiveness in enhancing CLIPs robustness to domain shift. Furthermore, our approach achieves better performance than most baselines. However, we acknowledge that our method trails the state-of-the-art DDSPL by 0.61% and ranks third overalljust 0.07% behind ProDA, indicating room for further improvement. On each domains, compared to zero-shot CLIP, our method achieves the largest gain on Clipart (+4.4%) and the smallest gain on Real World (+1.7%). Relative to the best-performing methods, our largest deficit occurs on Art, where we are 2.13% 32 6. Results Methods Art Real world Clipart Product Average CoCoOp 79.60 86.32 69.35 87.51 80.70 Zero-shot CLIP 80.50 89.10 70.20 88.30 82.03 CLIP-Adapter 82.76 88.02 70.08 88.04 82.23 CoOp 80.70 90.13 72.47 90.77 83.51 SPG 81.60 89.90 72.70 90.20 83.60 BPL 83.02 90.83 72.01 90.21 84.02 DPL 82.50 91.50 71.70 91.20 84.23 StyLIP 84.93 90.64 72.61 90.35 84.63 DPF (Ours) 82.80 90.80 74.60 91.70 84.98 ProDA 83.44 91.13 73.79 91.84 85.05 DDSPL 83.26 91.51 75.03 92.54 85.59 Table 6.1: Comparison of accuracy across different methods on the Office-Home dataset. The name of each column represents the target domain used during testing, while the other three domains serve as the source domains for training in that setting. Results are sorted in ascending order of average accuracy. below StyLIP, and the smallest deficit is on Clipart, where we are 0.43% below DDSPL. Based on the subsequent visualization analysis (Section 6.8.1), we hypothe- size that our methods poorer performance on the Art domain may be due to the fact that, for certain classes, the Art-domain image features differ only marginally from those of other domains, while for other classes the differences are more pronounced. This uneven discrepancy results in fused text features that fail to achieve consistent modality alignment with image features across all classes. Methods Painting Real Clipart Sketch Average Zero-shot CLIP 82.50 91.60 82.70 79.60 84.10 CoOp 83.63 89.73 84.87 79.43 84.42 BPL 83.01 92.21 85.03 80.85 85.28 DPF (Ours) 84.60 91.60 84.50 81.40 85.53 CoCoOp 83.78 91.60 86.50 81.34 85.81 ProDA 84.39 92.20 86.23 81.29 86.03 DDSPL 84.58 92.37 86.59 81.37 86.23 Table 6.2: Comparison of accuracy across different methods on the mini-DomainNet dataset. The name of each column represents the target domain used during testing, while the other three domains serve as the source domains for training in that setting. Results are sorted in ascending order of average accuracy. 6.3.2 Evaluations on Mini-DomainNet Table 6.2 presents the testing results of our method compared to the baselines on the mini-DomainNet dataset. On mini-DomainNet, our method still outperforms zero-shot CLIP and CoOp by 1.43% and 1.11%, respectively. However, it trails the 33 6. Results state-of-the-art DDSPL by 0.7% and underperforms ProDA (by 0.5%) and CoCoOp (by 0.28%), ranking fourth among all methods. This may be because we applied the exact same hyperparameters used for Office-Home without any adjustment; further hyperparameter tuning could potentially improve our results. Analysis by domain shows that our method achieves the best performance among all methods on Painting and Sketch. However, on Real, our method performs on par with zero-shot CLIP but is 0.77% behind DDSPL; on Clipart, although our method outperforms zero-shot CLIP by 1.8%, it falls 0.37% short of CoOp and 2.09% short of DDSPL. Due to time constraints, we did not visualize the image feature distributions on mini-DomainNet, but we suspect the causes are similar to those observed on Office-Home. 6.4 Ablation Study Setting Art Real world Clipart Product Average DAP only 82.5 90.7 73.4 91.6 84.55 DSP only 83.2 90.3 73.2 91.3 84.50 remove DSP’s Lorth 83.2 90.3 74.1 91.1 84.68 remove DFE 82.2 90.8 74.4 91.0 84.60 greedy fusion 82.3 90.4 73.6 90.3 84.15 average fusion 82.2 90.8 74.6 91.1 84.68 DPF (Ours) 82.8 90.8 74.6 91.7 84.98 Table 6.3: Ablation study results on Office-Home dataset. “DAP only” refers to using only the domain-agnostic prompt, with all other modules unchanged. “DSP only” refers to using only the domain-specific prompt, with all other modules un- changed. “Remove DFE” means no longer using the domain feature extractor to guide fusion; instead, the raw image feature is used directly. “Greedy fusion” refers to replacing weighted fusion with directly using the text feature from the domain with the highest similarity. “Average fusion” refers to omitting similarity calcula- tions and directly using the mean of the text features from all source domains. The ablation study results are presented in the Table 6.3. In this experiment, we remove or replace individual modules in our framework and observe the impact on overall performance. As shown, the full configuration achieves the best results, confirming the validity of our design; eliminating or altering any component leads to performance degradation. For example, removing the domain feature extractor and using the raw image feature to guide fusion results in a 0.38% drop in accuracy, indicating that the image feature alone does not sufficiently distinguish samples from different domains. Similarly, replacing weighted fusion with greedy fusion yields an accuracy decrease of 0.83%, demonstrating that even the most similar source domains text feature cannot reliably achieve modality alignment with the target-domain image feature and that weighted fusion is necessary. However, domain-specific observations reveal unexpected findings. On the Art 34 6. Results domain, using only the domain-specific prompt outperforms the combination of domain-agnostic and domain-specific prompts, suggesting that the domain-invariant information learned from the source domains may not transfer well to the target do- main and can mislead the model. Additionally, on the Real World and Clipart domains, average fusion produces results equivalent to our weighted fusion method, implying that the features in these domains are effectively the mean mixture of the other three source domains. 6.5 Analysis of Changing Soft Prompt Length Prompt length Art Real world Clipart Product Average 4+4 82.7 90.5 73.1 91.2 84.38 8+8 82.8 90.8 74.6 91.7 84.98 16+16 81.6 90.7 74.1 92.0 84.60 Table 6.4: Comparison results on Office-Home dataset. The prompt length is given in the order of “domain-agnostic prompt + domain-specific prompt”. We experimented with changing the length of the soft prompts and conducted re- peated tests on the target domain, as Table 6.4. We evaluated three length con- figurations. The results show that the medium-length soft prompt (8+8) exhibits the strongest generalization capability, followed by the longest (16+16), while the shortest (4+4) performs worst, which is 0.6% lower than the medium setting. This outcome may be due to the fact that shorter soft prompts cannot capture sufficient informative cues to guide the models classification, whereas longer soft prompts may encode excessive source-domain specific information, leading to some overfitting and reduced generalization. 6.6 Analysis of Domain Feature Extractor 6.6.1 Change the Dimension of the Domain Feature The results of varying the output dimensionality of the domain feature extractor (the dimension of the domain feature) are shown in the Table 6.5. Due to time constraints, we tested on only one target domain, but the findings should generalize across all domains. Note that on the training set, we assess the domain feature extractor by its domain-classification accuracy (i.e., its ability to correctly identify the domain of each sample), rather than by class-classification accuracy. This is because an ideal domain feature should be class-agnostic, meaning that samples from the same domain, regardless of class, should yield similar domain features. On the test seti.e., the target domainwe evaluate using class-classification accuracy, since our focus there is the domain featureguided text-feature fusion capability, and because the target-domain samples cannot be mapped onto any of the source domains for domain classification. 35 6. Results Dimension Source domain accuracy Target domain accuracy 512 89.90 74.60 256 89.68 74.50 128 89.61 74.70 64 90.10 74.30 Table 6.5: Training and testing accuracies for different domain feature dimensions. Results were obtained on the Office-Home dataset with source domains Real World, Product, and Art, and target domain Clipart. “Source domain accuracy” refers to the accuracy on the domain classification task over the source domains, and “Target domain accuracy” refers to the accuracy on the image classification task in the target domain. We observe that when the domain feature dimension is between 128 and 512, the training and testing accuracies are roughly equivalent, with only minor differences. However, when the domain feature dimension is lower, such as 64, even though the source-domain accuracy remains high, the target-domain accuracy drops noticeably. We speculate that at lower dimensions, the feature space is insufficient to fully capture the domain characteristics of the images, leading to overfitting on the source domains and a consequent reduction in generalization ability. 6.6.2 Change Design Setting Art Real world Clipart Product Average With dropout 82.8 90.8 74.6 91.7 84.98 Without dropout 82.4 90.6 74.3 91.0 84.58 Statistics (all layers) 82.4 90.9 74.6 91.3 84.80 Statistics (last 4 layers) 82.6 90.8 74.5 91.6 84.88 Statistics (last layer) 82.4 90.9 74.6 91.4 84.83 Table 6.6: Results of adjusting the design of domain feature extractor on Office- Home dataset. We tested alternative schemes for extracting domain features to validate the archi- tectural soundness of our domain feature extractor, including removing the dropout layers from the network, and trying methods similar to [65], [66], extracting statistics from intermediate layers of the image encoder to construct domain features, rather than relying solely on the features after the image encoder. The “without dropout” group was trained for 2,000 epochs. For the three “statistics” groups, the network was a simple linear layer trained for 10 epochs with a fusion temperature of 3.0. All other hyperparameter settings remained unchanged. Here we simply introduce the statistics method. For each layer’s output from the image encoder, the mean and standard deviation are extracted. For a ViT-B/16, the output of each layer has dimensions (batch_size, n_tokens, dim), where n_tokens = 36 6. Results 16 + 1 ([CLS]) and dim = 768. For all tokens except the [CLS] token, the mean and standard deviation are computed along each dimension in the “dim” axis, yielding a vector of size 2 × 768, denoted as [µl, σl]. These vectors from selected L layers are then stacked to form a vector [µ1, σ1, · · · , µL, σL] with dimensions L×2×768, which is subsequently passed through a linear layer to compress it to 512 dimensions. We observe that adding dropout layers does improve the generalization capability of the domain feature extractor to some extent, enabling it to extract more effective domain features (in terms of guiding textfeature fusion). Among the statisticsbased methods, all outperform the DFE without dropout, but still underperform the DFE with dropout. This may be because those methods were originally designed for CNNs and are not particularly suited to vision transformers. An interesting finding is that performance varies depending on which layers statistics are used. For example, using the last four layers yields the best DFE performance, whereas using all layers performs worse. This likely relates to the differing distribu- tions of statistics across layers. Theoretically, in vision transformers, higher layers extract more abstract, global image features, while lower layers focus on local, de- tailed features. Therefore, for our purposes, features from higher layers are more helpful for distinguishing samples from different domains. 6.7 Analysis of Fusion Mechanism Temperature Art Real world Clipart Product Average 0.1 82.4 90.5 73.9 90.3 84.28 0.5 82.6 90.8 74.5 90.8 84.68 0.8 82.8 90.8 74.6 91.7 84.98 1.0 82.6 90.8 74.5 91.2 84.78 Table 6.7: Results of adjusting fusion temperature on Office-Home dataset. The results of adjusting the fusion temperature on the Office-Home dataset are shown in the Table 6.7. Lower temperatures produce a “sharper” weight distri- bution, i.e. closer to greedy fusion; whereas higher temperatures yield a “flatter” distribution, i.e. closer to average fusion. The results indicate that a temperature of 0.8 achieves the best fusion performance; raising or lowering the temperature from this value degrades the results. This suggests that an appropriately contracted weight distribution better aligns the fused text features with the image features. 6.8 Visualization In this section, we will visualize the image features and domain features of sam- ples from the Office-Home dataset using t-SNE [67] to gain a more intuitive un- derstanding of domain shift and the effectiveness of the domain feature extractor. t-SNE is a classic nonlinear dimensionality reduction technique. The basic idea 37 6. Results is that, it measures pairwise similarity in the high-dimensional space using Gaus- sian distributions, computes pairwise similarity in the low-dimensional space using t-distributions, and minimizes the KL divergence between the two to enable visual- ization of high-dimensional data in a low-dimensional space. We used the TSNE function provided by the scikit-learn library, with all parameters kept at their default values. It should be noted that in t-SNE visualizations, the distances between clusters have no intrinsic meaning. Greater separation does not necessarily indicate larger distributional differences. However, cluster overlap can give a rough indication of proximity in the original feature space: clearly separated clusters suggest substantial distributional divergence, whereas overlapping clusters imply greater similarity. For comparative analysis, we also repeated all experiments in this section using the UMAP [68] method. Like t-SNE, UMAP is a nonlinear dimensionality reduction technique. However, UMAP leverages principles from topology and manifold learn- ing to project high-dimensional vectors into a lower-dimensional space, offering a better balance between preserving local and global structures compared to t-SNE. The overall trends observed in the visualizations of both image features and domain features are consistent across the two dimensionality reduction techniques. The UMAP visualization results can be found in Appendix A.2. 6.8.1 Visualization of Domain Shift (a) Alarm Clock (b) Chair (c) Computer Figure 6.2: Distribution of image features of the same class across different domains. Three classes were randomly selected from all classes for visualization. Figure 6.2 illustrates the distribution of image features for the same class across different domains. From the visualization, we can draw the following conclusions. First, there is indeed a certain degree of domain shift in CLIPs embedding space, manifested as distributional differences among samples from different domains. For example, in the “Chair” category, the four domains form clearly separated clusters with distinct boundaries. Second, the extent of domain shift varies across categories. Although the “Chair” category shows pronounced distributional differences, in the “Alarm Clock” category, Real World overlaps considerably with Art and Product; in the “Computer” category, Real World also overlaps significantly with Art. This implies that the degree of domain shift differs by category, which can affect the final classification performance. 38 6. Results By contrast, Figure 6.3 shows the distribution of image features for different classes within the same domain. Comparing both sets of visuals, we see that inter-class dis- tribution differences are significantly greater than inter-domain differences, as there is almost no overlap between samples of different classes. However, in light of the results from Section 6.3.1, we observe a clear relationship between class distribution and classification accuracy. For example, in the Product and Real World domains, class boundaries are sharp, resulting in higher accuracy; in Clipart, however, classes exhibit some mixing, leading to a noticeable drop in accuracy. This inter-domain variation in class-wise sample distributions can also be regarded as a form of domain shift. Theoretically, an ideal feature extractor for classification should be immune to domain-specific information and consistently extract class- relevant features, such that class distributions remain similar across domains. These are known as domain-invariant features. In our framework, we do not modify image features to counteract this type of domain shift, because altering image features without fine-tuning the backbone is challenging and prone to overfitting on small datasets, leading to catastrophic forgetting [3]. Given our time constraints, we defer investigation of this issue. (a) Art (b) Clipart (c) Product (d) Real world Figure 6.3: Distribution of image features of different classes within the same domain. Ten classes were randomly selected for visualization. 39 6. Results 6.8.2 Visualization of Domain Features (a) Art (b) Clipart (c) Product (d) Real world Figure 6.4: Distribution of domain features for the “Alarm Clock” category after extraction by the domain feature extractor. Each domain name indicates that it was treated as the target domain, with the DFE trained on the other three source domains. Figure 6.4 shows the distribution of domain features extracted by the DFE. Due to space limitations, we only present the results for the “Alarm Clock” category; other categories exhibit similar patterns, see Appendix A.1. We can see that after training, the DFE is indeed capable of distinguishing samples from different source domains, particularly separating Clipart from the other source domains. It is also noteworthy that, except when Real World is treated as the target domain, the samples from Real World, Art, and Product do cluster, but their boundaries are not sharply defined. In the next sections analysis, we will show that this blurred separation actually aligns with our requirements. Another very interesting observation concerns the distribution of target-domain sam- ples. During training, we have no access to any target-domain training samples, so we cannot make any assumptions or impose constraints on their distribution. How- ever, the results show that the DFE can extract domain features very effectively. 40 6. Results Figure 6.2a illustrates the actual distribution of image features for the “Alarm Clock” class across the four domains. Despite never seeing any target-domain samples dur- ing training, the DFE correctly projects them onto similar source domains: Art overlaps with Real World, Clipart overlaps with both Art and Real World, and Product overlaps with Real World. This effectiveness of the domain features is precisely what ensures the success of the subsequent fusion process. (a) Art (b) Clipart (c) Product (d) Real world Figure 6.5: Distribution of domain features for different classes. The DFE was trained on source domains Clipart, Real World, and Product. Ten classes were randomly selected for visualization. To demonstrate that the DFE extracts class-agnostic domain features, we visualized the domain features of samples from different classes, as shown in Figure 6.5. Due to space constraints, we only present the domain features extracted by one DFE; other DFEs show similar results. The figure shows that samples of different classes are thoroughly mixed in the domain feature space, with no discernible class separation. This indicates that the DFE indeed focuses solely on domain-related features while discarding most class-related information, further validating the effectiveness of our DFE design and training. 41 6. Results 6.8.3 “Ideal” Domain Feature Extractor Based on the analysis in the previous section and the results of the ablation study in Section 6.4, we believe that the rationality and effectiveness of the domain feature extractor design have been sufficiently demonstrated. However, some potential con- cerns may still arise: by projecting the domain features of the target domain onto one or more source domains, is there a risk that the model might mistakenly clas- sify them as belonging to those source domains? Would this truly lead to optimal performance? If it were possible to train an “ideal” DFE that ensures the target domain features remain largely independent from the distributions of all source do- mains, could the model then recognize that the target samples originate from a novel domain, thereby yielding improved results? Although it is not entirely impossible for the DFE to extract domain features from the target domain that are independent of those from the source domains, achieving this is indeed highly challenging, primarily because no target domain training data or even prior information is available during the training phase. However, from a purely theoretical perspective, we can adopt a convenient approximation: directly including target domain samples in the training set. This approach would explicitly guide the DFE to learn how to distinguish between target and source domain samples, thereby resulting in what could be considered an “ideal” DFE. Figure 6.6: Distribution of domain features for the “Alarm Clock” class extracted by the “ideal” domain feature extractor. Setting Art Real world Clipart Product Average “Ideal” DFE 82.2 90.7 74.6 91.2 84.68 “Normal” DFE 82.8 90.8 74.6 91.7 84.98 Table 6.8: Comparison between domain features extracted using the “ideal” domain feature extractor and those extracted by a DFE trained normally on source-domain samples. Experimental results show that the domain classification accuracy of the ideal DFE 42 6. Results reaches 90.38%. A visualization of the domain features extracted by the “ideal” DFE for the “Alarm Clock” category is presented in Figure 6.6. As can be seen, the model is indeed capable of distinguishing samples from different domains. Although the boundaries among the Real world, Product, and Art domains are somewhat ambiguous, samples from each domain still tend to form distinct clusters. We then replaced the “normal” DFE (trained on the source domains) with the “ideal” DFE to guide the fusion of text features. The results, as shown in Table 6.8, reveal that despite the “ideal” DFE’s ability to separate samples by domain, it performs worse in guiding text feature fusion compared to the “normal” DFE. This suggests that, from the perspective of fusion guidance, explicitly separating target domain samples from those of the source domains provides no benefit. If we compare this result with the “average fusion” baseline in Table 6.3, we can draw an interesting conclusion: the performance of the “ideal” DFE is almost identical to that of average fusion. This suggests that when different domains are completely separated, the model tends to treat the target domain as dissimilar to all source domains, resulting in nearly equal fusion weights across them. In this case, enforcing strict domain separation undermines the ability of the domain features to reflect inter-domain similari