Controlling Diffusion Input-Output Mapping of the Components of a Diffusion Model as a Potential Approach for Enhanced Model Control Master’s thesis in Complex adaptive systems Philip Gard DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Controlling Diffusion Input-Output Mapping of the Components of a Diffusion Model as a Potential Approach for Enhanced Model Control Philip Gard Department of Physics Chalmers University of Technology Gothenburg, Sweden 2023 Controlling Diffusion Input-Output Mapping of the Components of a Diffusion Model as a Potential Ap- proach for Enhanced Model Control Philip Gard © Philip Gard, 2023. Supervisors: Mats Granath, Physics and Hampus Linander, Mathematical Sciences Examiner: Mats Granath, Physics Master’s Thesis 2023 Department of Physics Division of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Visualization showing the effect of perturbations on the prompt "a pho- tograph of an astronaut riding a horse", the perturbation size was +15% in this image. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Controlling Diffusion Input-Output Mapping of the Components of a Diffusion Model as a Potential Ap- proach for Enhanced Model Control Philip Gard Department of Physics Chalmers University of Technology Abstract In this thesis, the primary focus is the exploration and evaluation of input-output mappings within the components of diffusion models, extending conventional meth- ods of control like prompt engineering. The objective is not to propose a definitive solution for control within diffusion models, but rather to probe the underlying mappings that drive these processes. The project examines two main components: The attention maps within the CLIP model, and the input-output relationships in the diffusion process itself. By doing so, the intention is to increase understanding of the inherent complexity of the models and identify potential control opportunities. In the intricate domain of diffusion models, the evaluation of input-output maps might not always offer a clear-cut measure of success. This project contributes by examining the interplay between various components of the model, as viewed through the lens of input-output mappings. Instead of presenting conclusive control methods, it examines these components, setting a possible path for further explo- ration. Keywords: diffusion model, stable diffusion, CLIP, input-output mapping, attention map, saliency map. v Acknowledgements I want to thank Mats Granath and Hampus Linander at Chalmers University of Technology for the instrumental guidance and support through this project. Compu- tations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE), partially funded by the Swedish Research Council through grant agree- ments no. 2022-06725 and no. 2018-05973. I would also like to acknowledge that this project would not have been possible without the open-source model Stable Diffusion by Runway, CompVis, and Stability AI, made available through Hugging Face. Philip Gard, Gothenburg, June 2023 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: BPE Byte Pair Encoding C3SE Chalmers Centre for Computational Science and Engineering CNN Convolutional Neural Network DDIM Denoising Diffusion Implicit Model DDPM Denoising Diffusion Probabilistic Model GAN Generative Adversarial Networks GLIDE Text-conditional image generation LDM Latent Diffusion Models LLM Large Language Model LMS Linear Multistep Schedulers ML Machine learning MSE Mean Squared Error NAISS National Academic Infrastructure for Supercomputing in Sweden NLP Natural Language Processing PE Positional Encoding PNDM Pseudo Numerical Diffusion Methods SDE Stochastic differential equation SNIC Swedish National Infrastructure for Computing VAE Variational Autoencoder ViT The Vision Transformer ix Contents List of Acronyms ix List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3.2 Specification of issues under investigation . . . . . . . . . . . . 2 1.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4.1 Misuse Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4.2 Impact on Artistic Professionals . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Overview and Context . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Stable Diffusion architectural overview . . . . . . . . . . . . . 3 2.2 Theoretical framework . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Converting prompts into Text embeddings . . . . . . . . . . . . . . . 4 2.3.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.3 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.4 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.5 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.6 Attention Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.7 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.8 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.9 CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Base Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2.1 Denoising Diffusion Implicit Models . . . . . . . . . 15 2.4.2.2 Linear Multistep Schedulers . . . . . . . . . . . . . . 16 2.4.2.3 Pseudo Numerical Methods . . . . . . . . . . . . . . 16 xi Contents 2.4.3 GLIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Efficient image representations . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.2 Latent Diffusion Models . . . . . . . . . . . . . . . . . . . . . 18 2.6 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 The Stable Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . 19 3 Methods 21 3.1 Software environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Creation of an Object-Oriented Pipeline . . . . . . . . . . . . . . . . 21 3.3 Input-Output Mapping of the Diffusion Model modules . . . . . . . . 22 3.3.1 Word Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Smoothly adjusting the prompt . . . . . . . . . . . . . . . . . 23 3.3.2.1 Adjustments for control . . . . . . . . . . . . . . . . 23 3.3.2.2 Comparing schedulers . . . . . . . . . . . . . . . . . 24 3.3.3 Prompt-to-Image Mapping . . . . . . . . . . . . . . . . . . . . 24 3.3.3.1 Perturbation Mapping . . . . . . . . . . . . . . . . . 24 3.3.3.2 Gradient Mapping through Finite difference . . . . . 25 3.3.4 Noise-to-Image Mapping . . . . . . . . . . . . . . . . . . . . . 25 3.4 Proposed method of control: Movement-to-Image . . . . . . . . . . . 26 4 Results 27 4.1 Text-impact mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 Standard attention map . . . . . . . . . . . . . . . . . . . . . 27 4.1.2 Average word attention . . . . . . . . . . . . . . . . . . . . . . 31 4.1.3 Word weight adjustment . . . . . . . . . . . . . . . . . . . . . 32 4.1.3.1 Discrete LMS Scheduler . . . . . . . . . . . . . . . . 32 4.1.3.2 DDIM Scheduler . . . . . . . . . . . . . . . . . . . . 34 4.1.3.3 PNDM Scheduler . . . . . . . . . . . . . . . . . . . . 35 4.2 Word-to-Image Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Perturbation Mapping . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Gradient Mapping . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Noise-impact mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Movement-to-Image pipeline Proof-of-Concept . . . . . . . . . . . . . 45 5 Discussion 49 5.1 Prompt consistency through the Encoder layers . . . . . . . . . . . . 49 5.2 Perturbation vs Gradient input-to-output mappings . . . . . . . . . . 49 5.3 Noise-impact mapping and Common sense . . . . . . . . . . . . . . . 50 6 Conclusion 51 A Appendix 1 I xii List of Figures 2.1 A simplified overview of the Stable Diffusion architecture. "Prompt" is the main input, and "Noise latents" are the secondary input, often represented as a seed. The "CLIP text encoder" is the model that converts a prompt into a vector, "Denoising U-Net" is the actual image generation model, and the "VAE decoder" is the model that turns the vector representation of an image into a pixel image . . . . . . . . . . 4 2.2 A overview of the Scaled Dot-Product Attention operation, the first "MatMul" represents the matrix multiplication QKT and "Scale" is the division be √ dk. The graph was displayed through Graphviz. . . 8 2.3 A overview of the Multi-Head Attention mechanism the "3D boxes" represents a number of "heads", "Concat" is the concatenate operation. This was displayed through Graphviz. . . . . . . . . . . . . . . . . . . 9 2.4 This image displays an example of an Attention Map for a trans- former, in this case, the indices i and j correspond to the words, or tokens: "a", "photograph", "of", "an", "astronaut", "riding", "a", "horse". This specific example is a sub-set of an Attention Map from the transformer model CLIP . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 A overview of the Transformer architecture. "Norm" corresponds to layer normalization, and "Output Probabilities" is the output of the decoder part of the model in the context of a text generator; this is commonly an index to a token, or word. . . . . . . . . . . . . . . . . 11 2.6 A overview of the unique training process that results in two encoders used in the CLIP model. Displayed through Graphviz. . . . . . . . . 13 2.7 An overview of the training process that is core to Diffusion models. "Image" corresponds to the training data or the reconstructed image depending on if noise is added or removed, corresponding to train- ing or interference, and "N" is the number of iterations of adding or removing noise. Displayed through Graphviz. . . . . . . . . . . . . . 14 2.8 A overview of the U-Net architecture. Displayed through Graphviz. . 18 2.9 A overview of the Stable Diffusion model architecture. The "Latent Space" is the lower dimensional vector representation of images, "Pixel Space" is the pixel representation, and "Prompt Space" is the string representation of the prompt. There are two paths in "Latent Space"; one is for interference, and one is for training. The interference path starts from the seed-generated noise and the prompt, whereas the training path starts with an image from the training dataset . . . . . 20 xiii List of Figures 4.1 The full attention maps of four heads in different layers i.e. the cross attention for all tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 A zoomed and cropped attention map for one of the heads from the first layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 The attention maps of four heads in different layers, the heads are the same as in 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 The attention score of one of the heads from the first layer. . . . . . . 31 4.5 The average attention score of all heads and all layers. . . . . . . . . 32 4.6 Image generation results for "a portrait of a cyborg in a golden suit, concept art", adjusted for the word: "suit". (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6 (Note that a light Stable Diffusion variant was used for computational efficiency, this does however reduce image quality). . . . . . . . . . . 33 4.7 Image generation results for "a photograph of an astronaut riding a horse", adjusted for the word: "photograph". (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.8 Image generation results for "a photograph of an astronaut riding a horse", adjusted for the word: "astronaut". (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.9 Image generation results for "a portrait of a cyborg in a golden suit, concept art", adjusted for the word: "suit" with different epsilon values using the DDIM scheduler. (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. . . . . . . . . . . . . 34 4.10 Image generation results for "a portrait of a cyborg in a golden suit, concept art", adjusted for the word: "suit" with different epsilon values using the PNDM scheduler. (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. . . . . . . . . . . . . 35 4.11 Visualization showing the effect of perturbations on the prompt "a photograph of an astronaut riding a horse", the perturbation size was +15% in this image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.12 Visualization constructed showing the effect of perturbations on the prompt "a photograph of an astronaut riding a horse", the perturba- tion size was +15% in this image, and the perturbation differences were directly overlaid. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.13 Images showing the effect of +15% size perturbations for each word in the text prompt "a photograph of an astronaut riding a horse". . . 38 4.14 Visualization constructed in Python showing the "Saliency Map" of text prompts impact on the image output (approximated through the Finite difference method). . . . . . . . . . . . . . . . . . . . . . . . . 39 4.15 Visualization constructed showing the "Saliency Map" of text prompts impact on the image output, the differences are directly overlaid. . . . 40 4.16 The "gradients" in image format computed during the Finite differ- ence method for each word in the text prompt "a photograph of an astronaut riding a horse". . . . . . . . . . . . . . . . . . . . . . . . . . 41 xiv List of Figures 4.17 The moving average (the line) of the similarity score, as well as the standard deviation (with an infinite window), computed by the CLIP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.18 The moving standard deviation of the similarity score, computed by the CLIP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.19 Examples of prompt-dependent variation. (a)-(d) Images generated for the prompt "a photograph of an astronaut in space". (e)-(h) Im- ages generated for the prompt "a photograph of an astronaut riding an elephant". (i)-(l) Images generated for the prompt "a photograph of an astronaut riding a horse in the desert". (m)-(p) Images gener- ated for the prompt "a photograph of an astronaut riding a horse in the prism" (Note that a light Stable Diffusion variant was used for computational efficiency, this does however reduce image quality). . . 44 4.20 A image showing part of the movement-to-image pipeline proof-of- concept. The plot is the interface; in this case, a PCA plot on the embedding space, and the red dot is the input embedding. . . . . . . 45 4.21 A image showing part of the movement-to-image pipeline proof-of- concept. The image is the generated image corresponding to the interface 4.20 and the prompt "a picture of a space explorer galloping on a horse". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.22 A image showing part of the movement-to-image pipeline proof-of- concept. The plot is the interface and the red dot is the adjusted input embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.23 A image showing part of the movement-to-image pipeline proof-of- concept. The image is the generated image from adjusted embedding corresponding to the interface 4.22 (Note that a light Stable Diffu- sion variant was used for computational efficiency, this does however reduce image quality). . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A.1 Full attention maps of all heads in the first layer i.e. the cross atten- tion for all tokens at every layer. . . . . . . . . . . . . . . . . . . . . . I A.2 Zoomed and cropped attention maps for all of the heads in the first layer, providing a comparison between heads. . . . . . . . . . . . . . . II A.3 The attention scores of all heads in the first layer. . . . . . . . . . . . III A.4 Other image generation results with different epsilon values using the DDIM scheduler. The top two rows show the embeddings adjusted for the word: "photograph" and the bottom two rows show the em- beddings adjusted for the word: "astronaut". For each set of images from left to right: (a) ϵ = −0.6, (b) ϵ = −0.4, (c) ϵ = −0.2, (d) ϵ = 0.0, (e) ϵ = 0.2, (f) ϵ = 0.4, (g) ϵ = 0.6. . . . . . . . . . . . . . . . IV A.5 Other image generation results with different epsilon values using the PNDM scheduler. The top two rows show the embeddings adjusted for the word: "photograph" and the bottom two rows show the em- beddings adjusted for the word: "astronaut". For each set of images from left to right: (a) ϵ = −0.6, (b) ϵ = −0.4, (c) ϵ = −0.2, (d) ϵ = 0.0, (e) ϵ = 0.2, (f) ϵ = 0.4, (g) ϵ = 0.6. . . . . . . . . . . . . . . . V xv List of Figures xvi List of Tables 2.1 Explanation of Query, Key, and Value . . . . . . . . . . . . . . . . . . 7 3.1 Key objects in the pipeline . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Hyperparameter used during the "Adjustments for control" evaluation. 24 3.3 Settings used for the diffusion model when computing the image for prompt-to-image map. . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xvii List of Tables xviii 1 Introduction 1.1 Background In recent years, the machine learning field has been revolutionized by the advent of very large models capable of undertaking a multitude of tasks. Notable examples include GPT-4, adept at text generation, and DALL-E 2, which excels in image synthesis. These models demonstrate an intriguing blend of scale and generality, setting new standards in areas such as natural language processing and computer vision. However, their widespread use in practical applications where initially limited. While potentially powerful, the outputs of these models often require specialized knowledge for control and effective use. While models like GPT-4 exhibit impressive capabilities, their direct application can be challenging, particularly in tasks like chatbot creation. For instance, unless properly prompted, these models tend to generate the bot’s response and the antici- pated user’s next question. This highlights the complexity of controlling the model’s output and has prompted the development of more refined approaches. An example of such is ChatGPT. This tool, built on GPT-4 or GPT-3, depending on the version, implements a response-to-answer structure that limits the model’s responses, thus simplifying its use as a chatbot. This problem, however, seems to persist in image generation models where the output of the models does not seem to be much more useful in commercial production compared to what an internet search might produce, as most commercial products appear to still employ largely the same methods of asset generation as before. This seemingly implies that the image generation models are at most used as a step in this generation, likely concept image generation, which internet searches are already relatively good at. 1.2 Aim This project aims to explore possible paths for increasing control of the output from large image generation models based on Diffusion through input-output mappings, i.e., Determining which parts of input have which effects on the output. The focus is on the Stable Diffusion models created by Stability AI. 1 1. Introduction 1.3 Problem description 1.3.1 Limitations In the interest of time, this project has been limited to mostly using pre-trained models to decrease the need for hyperparameter tuning. This project has also not explored fine-tuning of the methods evaluated unless strictly necessary for the eval- uation, i.e., the project has focused on proof-of-concept. 1.3.2 Specification of issues under investigation This master thesis aims to handle the following questions: 1. Can input-output mappings be constructed for the modules, i.e., The sub- models of a large ML-model, specifically the Stable Diffusion model? 2. Can input-output mappings be leveraged to gain insights into possible control opportunities? 3. Can these insights be extended to actual proof-of-concept control methods? 1.4 Ethical Considerations Using diffusion models in image generation introduces ethical issues: Potential risk of misuse and impact on artistic professionals. 1.4.1 Misuse Risk High-fidelity images generated through diffusion models hold some potential for harmful content production, for example, misinformation or violent content. Avoid- ing this might require robust content moderation or, more likely, algorithmic filtering systems. 1.4.2 Impact on Artistic Professionals Diffusion models can both aid and challenge artists. They could enhance creativity but potentially undermine the work of artists specializing in traditional tools. This needs to be accounted for while not overlooking the potential creativity boost for all individuals who stand to benefit from the employment of more powerful tools, like AI. 2 2 Theory The basic concept behind neural networks is a model consisting of a large number of tunable parameters, employing matrices to create a structure reminiscent of biolog- ical neural networks. This, combined with optimization methods, creates a model capable of generalized learning [1]. Some extensions on this base model on the level of the artificial neuron exist, for ex- ample, Convolutional Neural Networks (CNNs) [2]. However, many models simply combine these basic layers and the extended layers, in novel and innovative archi- tectures, training methods, and interference procedures, for example, Generative Adversarial Networks (GANs), which use a distinctive adversarial training process [3]. Many of the models that have revolutionized the machine learning field in recent years represent an extra level of abstraction. These model’s architectures employ multiple models, which themselves are based on many types of layers. For example, the Stable Diffusion model includes text embedding networks based on transformers as well as Autoencoders [4]. The following sections provide a description of each of the parts that make up the Stable Diffusion model, which is heavily utilized throughout this project, with the only exception consisting of the CLIP image encoder (which is not directly a part of the Stable Diffusion model). 2.1 Overview and Context This section walks through the Stable Diffusion architecture at a high level. This can be viewed as the context for the models treated in this chapter. 2.1.1 Stable Diffusion architectural overview The core of Stable Diffusion is the denoising model, i.e., the (diffusion) model that turns noise into an image. In the case of Stable Diffusion, this is referred to as the "Denoising U-Net". Stable Diffusion does not operate on images directly but instead on a lower-dimensional representation of an image. The model that turns this representation into an image is the "VAE decoder". Lastly, the denoising model is prompt-directed, i.e., the model uses a vector representation of a prompt to direct the image generation. The "CLIP text encoder" (a transformer model) produces this prompt vector representation. 3 2. Theory Figure 2.1: A simplified overview of the Stable Diffusion architecture. "Prompt" is the main input, and "Noise latents" are the secondary input, often represented as a seed. The "CLIP text encoder" is the model that converts a prompt into a vector, "Denoising U-Net" is the actual image generation model, and the "VAE decoder" is the model that turns the vector representation of an image into a pixel image 2.2 Theoretical framework The components in the Stable Diffusion model can be viewed from the base theo- retical framework provided by Autoencoders, i.e., models or processes representing encoders and decoders. 2.2.1 Autoencoders Autoencoders are a type of neural network designed for data compression and noise reduction. Their architecture consists of two key components: an encoder and a decoder. The encoder function, denoted as f , takes an input x, converting it into a compressed latent representation, h, i.e., h = f(x). This latent space encapsulates the essential features of the input while shrinking the dimensionality. The decoder function, g, reconstructs the input data from the latent space. It works by mapping h back into the original high-dimensional input space, denoted as x′. Thus, the reconstructed input is given by x′ = g(h). The goal during training is to minimize the difference between the input data x and the reconstructed data x′, i.e., x ≈ x′. This is achieved using a loss function, typically the Mean Squared Error (MSE), which measures the average squared differences between x and x′. This process results in a model that can compress x into h (using the encoder) and then reconstruct an approximation of x, i.e., x′ (using the decoder) [5]. 2.3 Converting prompts into Text embeddings The Stable Diffusion model employs text embeddings to direct the Diffusion process, i.e., allowing prompt-directed image generation. 2.3.1 Tokenization Tokenization is the first step in the processing pipeline of Natural Language Process- ing (NLP). It entails dissecting a text string into a sequence of individual fragments 4 2. Theory represented as an integer, referred to as "tokens". The granularity of these tokens can vary, spanning from single characters to whole words or even larger chunks of text, depending on the specifics of the tokenization approach. The goal of the tokenizer is to represent any text with as few distinct tokens as possible in order to minimize the dimensionality of the vocabulary that is used in the next step of the NLP pipeline, i.e., the Word Embeddings, while at the same time capturing the semantic structure inherent in words and phrases. There are two ’naive’ approaches to tokenization: character-level and word-level. Character-level tokenization treats every character as a token, effectively minimizing the vocabulary size, which is beneficial for computational efficiency. However, it falls short in capturing semantic structure, i.e., part of the information contained in the text [6]. Word-level tokenization, on the other hand, treats every word as a token, preserv- ing semantic structure and information. This method, however, results in a large vocabulary. For example, in languages where words can have many forms, such as English, word-level tokenization ends up treating "run", "runs", "running", and "ran" all as different tokens, thus increasing the size of the vocabulary. A more practical middle-of-the-road approach is subword-level tokenization, i.e., breaking down words into frequently occurring subwords. This approach balances minimizing the vocabulary and capturing the semantic structure. A common method for achieving this is Byte Pair Encoding (BPE), a technique used in the CLIP model. BPE begins with a vocabulary of individual characters. It then iteratively merges the most frequent adjacent character pairs in the dataset to form new symbols, thereby extending the vocabulary. This process continues until a specified vocabu- lary limit (a hyperparameter) is reached. BPE’s strength lies in its ability to handle out-of-vocabulary words by breaking them down into recognizable subword units, efficiently capturing semantic information while managing vocabulary size [7]. 2.3.2 Word Embedding After tokenization, the next critical step in the NLP pipeline is word embedding, a technique that converts discrete tokens into a continuous vector representation. This process involves mapping each token, represented by a unique integer ID from the tokenization step (see 2.3.1 above), into a high-dimensional vector space [8]. Word embeddings are created using an embedding layer, which operates as a lookup table that links unique integer IDs to dense vectors. These vectors are initially filled with random values. During training, these values are fine-tuned to reduce a cost function that measures how well the model predicts contextual words for a given target word. The better the model’s predictions, the more effectively the embeddings capture the relationships between words. As a result, semantically similar words end up with similar embeddings, reflecting their contextual similarities [9]. 2.3.3 Positional Encoding The next step, after obtaining the Word Embeddings, is to add the Positional En- coding (PE). Positional Encoding is a way to encode the information about where 5 2. Theory a word is in a text. This is crucial, as a word’s position significantly influences the semantic information in a text. The Positional Encoding scheme addresses a fundamental limitation in transformer models: their inability to inherently capture the order of words due to the self- attention mechanism (2.3.4). It does so by creating a set of vectors representing each word’s position in the sequence according to 2.1. A frequent method of Positional Encoding is to use sinusoidal functions, which are commonly defined as: PE(pos,2i) = sin ( pos 100002i/dmodel ) , PE(pos,2i+1) = cos ( pos 100002i/dmodel ) , (2.1) pos is the position of the word, or token, in the sequence of words that is the text, and i is the dimension in the embedding vector. That is, each dimension of the positional encoding corresponds to a sinusoid with a different frequency. Here dmodel is the dimension of the model [10]. Positional Encoding is added directly to the word embeddings, resulting in encoded vectors that carry both the semantic meaning of the words and their positional information. This amalgamated information becomes the input for the subsequent layers of the model. This technique allows the model to capture a unique encoding for each position and a relative representation of the distance between different positions in the sequence. The model, hence, becomes capable of recognizing patterns based on the order of words, a critical requirement for understanding and generating human language [10]. 2.3.4 Attention The main part of the Transformer architecture is the Attention mechanism. At- tention, in the context of neural networks, refers to the model’s ability to focus on specific parts of the input when generating the output. It is inspired by human attention, where we focus more on certain aspects of our environment while paying less attention to others. In transformers, attention is commonly computed using the Scaled Dot-Product Attention mechanism. This process involves three main components: Query (Q), Key (K), and Value (V), each of which corresponds to dif- ferent aspects of the input data. These are typically derived from the input sequence through a linear transformation, a trainable part of the model [10]. 6 2. Theory The idea of the Query, Key, and Value can be understood from the context of search: Term Analogy in Search Transformer’s Meaning Query (Q) Search Query Current context used to select relevant input Key (K) Search Keywords Elements in the input sequence used for matching Value (V) Search Results Information pieces used to build the output Table 2.1: Explanation of Query, Key, and Value The Query (Q) and Key (K) components are used to compute the attention scores. This involves taking the dot product of Q and K, scaling it by the square root of their dimension size, and applying a Softmax function. This results in a distribution that indicates the amount of "attention" each word in the input should receive when generating a new word in the output sequence. The Value (V) component corre- sponds to the actual content associated with each word. The computed attention distribution is then multiplied by V, essentially deciding the information to pass onto the next layer based on the calculated attention. To summarize, the attention mechanism can be represented as: Attention(Q, K, V ) = Softmax ( QKT √ dk ) V (2.2) In this equation, Q, K, and V are matrices representing the Query, Key, and Value vectors for all words in the input sequence. The trainable parameters in this atten- tion mechanism live within the linear transformations that derive Q, K, and V from the input sequence [10]. 7 2. Theory Figure 2.2: A overview of the Scaled Dot-Product Attention operation, the first "MatMul" represents the matrix multiplication QKT and "Scale" is the division be√ dk. The graph was displayed through Graphviz. 2.3.5 Multi-Head Attention The transformer extends the concept of attention into what is known as Multi-Head Attention. Operating at the same level in the network as traditional attention, this enhanced mechanism allows the model to generate multiple "interpretations" of the input, focusing on different words for each interpretation. In Multi-Head Attention, the inputs (query, key, and value) are first linearly trans- formed into multiple sets. Each of these sets is then fed into its own Scaled Dot- Product Attention mechanism, creating multiple "heads". The outputs of all the attention heads are then concatenated and linearly transformed to result in the final output [10]. This approach enhances the capacity of the model to focus on different parts of the input simultaneously, thereby capturing a more comprehensive understanding of the input data. The more attention heads, the greater the number of interpretations, leading to a more nuanced understanding of the sequence. 8 2. Theory Figure 2.3: A overview of the Multi-Head Attention mechanism the "3D boxes" represents a number of "heads", "Concat" is the concatenate operation. This was displayed through Graphviz. 2.3.6 Attention Maps Attention maps portray the attention weights between tokens in a specific input sequence, offering a detailed look into the internal decision-making process within a transformer model. These maps are presented as square matrices, Attention Mapi,j, with indices i and j denoting the positions of query and key tokens, respectively, for a particular input sequence [10]. 9 2. Theory Figure 2.4: This image displays an example of an Attention Map for a transformer, in this case, the indices i and j correspond to the words, or tokens: "a", "photograph", "of", "an", "astronaut", "riding", "a", "horse". This specific example is a sub-set of an Attention Map from the transformer model CLIP However, interpreting attention maps should be approached with care. High at- tention scores may not necessarily equate to a significant influence on the model’s final output, as the actual impact of these attention scores depends on the intricate network dynamics [11]. 2.3.7 Transformer The transformer, a significant architecture first introduced in the paper "Attention is All You Need" [10], ingeniously assembles the previously discussed components - tokenization, word embeddings, positional encoding, and attention mechanisms - into a model for sequence processing. The transformer architecture consists of an encoder and a decoder, each composed of multiple identical layers stacked on top of each other. Each of these layers in the encoder contains two main sub-layers: a Multi-Head Attention mechanism and a fully connected feed-forward network. Around each of these sub-layers, a residual connection is employed, followed by layer normalization i.e., shifts the activations within the layer to have a mean of zero and a standard deviation of one [10]. Initially, the input sequence is tokenized (see 2.3.1), and each token is embedded into a high-dimensional vector using an embedding layer (see 2.3.2). Positional encoding is then added to these word embeddings to incorporate the order of the words (see 2.3.3). These encoded vectors are passed into the encoder, the gray box on the left in figure 10 2. Theory 2.5, which uses the Multi-Head Attention mechanism to establish a weighted repre- sentation of the input sequence, considering the interaction of each word with every other word. The feed-forward network processes these weighted representations in- dependently to produce the encoder output. Figure 2.5: A overview of the Transformer architecture. "Norm" corresponds to layer normalization, and "Output Probabilities" is the output of the decoder part of the model in the context of a text generator; this is commonly an index to a token, or word. 11 2. Theory The decoder, the gray box on the right in figure 2.5, in a transformer model, in- cludes an additional sub-layer compared to the encoder. This sub-layer uses at- tention, specifically multi-head attention, on the encoder’s output, enabling tasks that require correlation between sequences, such as translation. The final decoder layer’s output is then converted into output probabilities for each token in the tar- get vocabulary. Unlike RNNs and LSTMs, Transformers are particularly effective at capturing long-range dependencies in sequence data due to their comprehensive uti- lization of attention mechanisms. This makes them a robust and powerful approach for sequence-to-sequence tasks [10]. 2.3.8 Vision Transformer Much like how traditional transformers handle sequential data, such as text, the Vision Transformer (ViT) deals with visual data and images. Its distinctive feature is treating an image as a sequence of patches, much like how words form a sequence in a sentence. This approach helps to identify and utilize the inherent structure in the visual data, making the Vision Transformer a highly effective model for image classification and related tasks. An image, in the context of a ViT, is divided into a grid of non-overlapping patches analogous to tokens in text processing. These patches are then linearly embedded to form a sequence of vectors (analogous to how word embeddings form prompt embedding). Following the practice of text processing, a specific "start token" is commonly appended to the beginning of this sequence. This token provides context and some positional information for the whole image, similar to how it is used in text sequences [12]. Positional embeddings, analogous to the ones in the transformer model for NLP tasks, are also added to each patch embedding to encode their relative positions in the image grid. The combination of patch embeddings, positional embeddings, and the "start token" creates the final sequence of vectors that forms the input to the transformer. The transformer layers in the Vision Transformer model are the same as those in the classic transformer: multi-head self-attention and feed-forward neural networks, coupled with layer normalization and residual connections. The model, through these layers, learns to relate patches not just with their immediate neighbors but with distant ones as well, ultimately capturing global context of an image [12]. The ViT serves as a compelling exemplification of how transformer architectures can effectively be applied to computer vision. By reframing image analysis in terms of sequential data processing. The ViT ushers in a new avenue of image interpretation that prioritizes global context and long-range interdependencies. This move away from the localized focus of traditional CNNs enhances the model’s capability to discern complex, interwoven visual narratives within an image [12] [13]. 2.3.9 CLIP The CLIP model is a widely used model for connecting text and images through embeddings. The concept is similar to Autoencoders with the fundamental differ- 12 2. Theory ence being that CLIP is not trained to reproduce a text prompt from a image, or vice versa. Instead, the objective function consists of minimizing the cosine similar- ity between non-matching text-image pairs while maximizing the cosine similarity between matching text-image pairs, thus the model employs two encoders. The models used to achieve this are one Transformer for the text-encoder and a Vision Transformer for the image-encoder [14]. Figure 2.6: A overview of the unique training process that results in two encoders used in the CLIP model. Displayed through Graphviz. The cosine similarity between a text embedding, t, and an image embedding, i, can be computed as: Cosine Similarity(t, i) = t · i ||t||2 × ||i||2 (2.3) Where t · i is the dot product of the text and image embeddings, and ||t||2 and ||i||2 are their respective L2 norms. 13 2. Theory 2.4 Diffusion Models The core of the Stable Diffusion model is the Diffusion module. This is the module that causes the apparent "creativity" displayed by the model. 2.4.1 Base Diffusion Model Diffusion models can be understood as stochastic iterative versions of Autoencoders. The "encoder" in a diffusion model is represented by a stochastic differential equation (SDE) which progressively corrupts the data with noise, driving it towards a simpler distribution (usually a standard multivariate Gaussian). This can be thought of as a noisy generalization of an Autoencoder’s encoder mapping the data into a latent space. The "encoding" process in a diffusion model is mathematically represented by a stochastic differential equation (SDE) that resembles Brownian motion. The SDE is given by: dxt = √ 2α dWt, (2.4) Where xt represents the state of the data at time t, α signifies the timestep and Wt is a standard Wiener process. At each timestep, Gaussian noise is added to the data, gradually driving it towards a simpler distribution, typically a standard multivariate Gaussian [15]. The "decoder" in a diffusion model corresponds to the reverse process that starts from samples drawn from the simpler distribution and progressively removes the added noise, reconstructing the original data. This mirrors an Autoencoder’s decoder, which attempts to map from the latent space back to the original data space [15]. Figure 2.7: An overview of the training process that is core to Diffusion models. "Image" corresponds to the training data or the reconstructed image depending on if noise is added or removed, corresponding to training or interference, and "N" is the number of iterations of adding or removing noise. Displayed through Graphviz. However, unlike a standard Autoencoder, the reverse or "decoding" process in a diffusion model is not deterministic. It involves learning a Markov transition kernel that captures the reverse of the noise addition process described above. 14 2. Theory In mathematical terms, this involves learning a reverse-time SDE of the form dxt = − √ (2α)dWt +βdt, where β is a function that is learned from the data. This learned function serves as a "denoising" operation that removes noise from the data. Training a diffusion model involves optimizing this function to minimize the difference be- tween the original and reconstructed data, similar to minimizing the reconstruction error in an Autoencoder. 2.4.2 Schedulers In the context of diffusion models, schedulers play a crucial role in determining how the noise is added and removed over the course of the diffusion process. Essentially, they govern the evolution of the noise variance over time, which directly influences the behaviour of the stochastic differential equations (SDEs). A noise schedule is a sequence of noise levels (αt) at each timestep. During the forward or "encoding" process, the scheduler determines how much noise is intro- duced at each timestep. This is usually done in such a way that the variance of the noise increases over time, moving the data closer to the target distribution (often a standard multivariate Gaussian) [16]. During the reverse or "decoding" process, the scheduler dictates the "denoising" process. It governs how much of the noise is subtracted at each timestep, guiding the data’s trajectory back to the original distribution. In this case, the noise levels typically decrease over time. Schedulers are usually chosen based on empirical performance. There is a wide range of possible schedules, from linear ones where the variance increases or decreases lin- early with time to more complex non-linear schedules. The optimal schedule depends on the specific dataset and task at hand, and discovering good noise schedules is an active area of research in diffusion models. 2.4.2.1 Denoising Diffusion Implicit Models The widely employed Denoising Diffusion Probabilistic Models (DDPMs) operate as standard diffusion processes in the domain of diffusion models, following a noise schedule set by the scheduler. However, Denoising Diffusion Implicit Models (DDIMs) offer a key alternative that extends the diffusion process beyond the standard Marko- vian (depending only on the preceding state) framework. DDIMs address the computational inefficiencies of DDPMs by employing non-Markovian (depending on more than the preceding i.e. the path matters) diffusion processes. Both models share similar training procedures but diverge in their generative pro- cesses and diffusion mechanisms [17]. In contrast to DDPMs, where the generative process is the reverse of a Markovian diffusion process, DDIMs facilitate a deterministic generative process, significantly improving sample speed. The extended diffusion process in DDIMs involves designing reverse generative Markov chains. This results in a training objective identical to DDPMs, but it per- mits a wider array of generative models. These non-Markovian diffusion processes generate "short" Markov chains, improving sample efficiency with only a minor com- promise in sample quality [17]. 15 2. Theory Finally, DDIMs enhance sample generation quality compared to DDPMs, particu- larly under accelerated sampling conditions. DDIMs also incorporate a "consistency" property, enabling semantically meaningful image interpolation capabilities [17]. 2.4.2.2 Linear Multistep Schedulers Scheduling of noise levels is pivotal in controlling diffusion processes, especially in Denoising Diffusion Probabilistic Models (DDPMs). To enhance this, Linear Multistep Schedulers (LMS) have been designed, offering a nuanced approach to noise management. In stark contrast to the conventional linear or straightforward schedules, LMS em- ploys a more flexible approach, enabling the tailoring of the β or noise values used at each step of the diffusion process. This adaptability ensures the potential to optimize noise schedules according to the specific requirements of a given task or dataset [18]. The critical parameters under LMS control include the starting and ending noise values, the total number of training timesteps, and the type of prediction. The ability to manipulate these parameters allows for an informed and strategic approach to noise management in DDPMs, contributing to enhanced model performance. In essence, the LMS offers a more dynamic and customizable solution for noise scheduling within DDPMs, serving as an instrumental tool for potential improve- ments in model efficiency and sample generation quality [18]. 2.4.2.3 Pseudo Numerical Methods Pseudo Numerical Methods for Diffusion Models (PNDMs) is a novel approach to further enhance the effectiveness and efficiency of generative diffusion models. This methodology operates under a unique perspective, viewing Denoising Diffusion Prob- abilistic Models (DDPMs) as a process of solving differential equations on manifolds i.e., solving the differential equation locally. Accelerating DDPMs traditionally involves adjusting the variance schedule or mod- ifying the denoising equation. However, these changes often lead to a compromise in sample quality and could potentially introduce new noise at high speedup rates, thereby limiting their practical utility. To address this, PNDMs pioneer the in- troduction of pseudo-numerical methods specifically designed for diffusion models [19]. Pseudo-numerical methods represent a variant of classical numerical methods adapted to operate efficiently on manifolds. By incorporating these methods, PNDMs can markedly speed up the inference process while maintaining high sample quality. PNDMs have indeed demonstrated their potential by producing superior-quality synthetic images in significantly fewer steps compared to Denoising Diffusion Im- plicit Models (DDIMs). The application of PNDMs is versatile and accommodates a variety of DDIM-like models. These include DDIM, Score-Based Generative Modeling through Stochas- tic Differential Equations (SDE), and Improved Denoising Diffusion Probabilistic Models (DDPM). This adaptable framework provides an array of choices for the nu- merical method, the noise addition schedule, and the neural network used for fitting 16 2. Theory noise, facilitating adaptability to diverse tasks and datasets [19]. In conclusion, PNDMs represent a promising direction in the domain of diffusion models. They offer an efficient and effective approach to sample generation while maintaining a high level of quality. Their flexible framework and seamless integration with other libraries make them a versatile tool in the field of generative modeling. 2.4.3 GLIDE An important component in the diffusion modules in the Stable Diffusion model is text-conditional image generation i.e. directing the image denoising with text. This was introduced in the GLIDE model [20], and is simply done by giving the diffusion decoder a vector representing the caption (text embedding) of the image during the training. After training this produces a decoder that will produce an image from noise consistent with the text embedding it is given alongside. The Stable Diffusion model uses the CLIP encoder to produce this text embedding from the prompt. 2.5 Efficient image representations The Stable Diffusion employs a latent image space enabled by a Variational Au- toencoder. This is the main difference between the Stable Diffusion model and, for example, DALL-E-2 and the reason this model can run on relatively light hardware. 2.5.1 VAE Variational Autoencoders (VAEs) represent an intriguing twist to the conventional autoencoder structure. While traditional autoencoders encode input data into a deterministic latent representation, Variational Autoencoders, in contrast, model the input data as a probability distribution within the latent space. This fundamental shift enables VAEs to generate new data instances that closely resemble the input data. In VAEs, the encoder doesn’t directly produce a single latent vector. Instead, it outputs parameters of a Gaussian distribution: a mean vector (µ) and a standard deviation vector (σ). The reparameterization trick is then employed, using an aux- iliary noise variable ϵ sampled from a standard normal distribution to derive the latent vector: h = µ + σ ∗ ϵ [21]. One significant advantage of VAEs over traditional autoencoders is their ability to model complex and multi-modal data distributions. By representing the latent space probabilistically, VAEs can handle a broader range of data distributions, which often results in better representations of the input data. Moreover, VAEs can generate new instances of data that bear a strong resemblance to the input data, thus serving as a potent tool for generative tasks. Despite these advantages, VAEs do have their limitations. The generated data from VAEs often exhibit a certain "blurriness". This stems from the decoder’s expectation operation, which averages over the sampled latent variables, leading to less sharp reconstructions. Additionally, the computational cost of training VAEs is higher 17 2. Theory than that of traditional Autoencoders due to the need for Monte Carlo sampling to estimate the gradients during backpropagation. 2.5.2 Latent Diffusion Models The main thing that separates Stable Diffusion from other large and effective diffu- sion models like DALL-E-2 is that the diffusion is done in a highly compressed space. This is achieved by training the diffusion model with a vector representation of an image (latent vector) instead of an actual full or close to the full-resolution image; this is the concept behind Latent Diffusion Models (LDMs). This compressed latent space is commonly achieved by training a Autoencoder on a data set of images [22]. 2.6 U-Net The U-Net is a specific type of convolutional neural network (CNN) that follows the architectural principles of autoencoders, designed with a symmetric "U" shaped structure. U-Nets adopt the encoding-decoding philosophy of autoencoders, which has led to their successful application in various image processing tasks, notably in biomedical image segmentation [23]. In U-Nets, the encoding (contracting) pathway is responsible for capturing the context in the image, with a series of convolutional and max-pooling layers that progressively reduce the spatial dimensions of the input. The decoding (expanding) pathway, consisting of up-convolutional layers, serves to increase the spatial dimensions while reconstructing the original image’s details. A distinguishing feature of U-Nets is the presence of skip connections between the encoding and decoding pathways, reminiscent of autoencoders’ bottleneck design. These lateral connections pass detailed spatial information directly from the encod- ing path to the decoding path, allowing high-level and low-level features to be fused. This mechanism enables U-Nets to better localize and delineate features of inter- est, making the architecture highly effective for tasks that require precise spatial information. Figure 2.8: A overview of the U-Net architecture. Displayed through Graphviz. 18 2. Theory In the context of diffusion models, U-Nets offer several compelling advantages. Their ability to handle image noise, retain spatial information, and produce highly detailed and coherent outputs aligns with the requirements of denoising functions within diffusion models. As such, U-Nets have found application in this domain, further illustrating the versatility and robustness of their autoencoder-inspired design (The U-Net for Diffusion models was introduced in [24]). 2.7 The Stable Diffusion Model Stable Diffusion is one of a few large diffusion models; it is capable of generating high-quality images. Unlike other large diffusion models, for example, DALL-E-2 Stable Diffusion is open-source and can run on relatively light hardware. Stable Diffusion is an amalgamation of the concepts covered in the preceding sections, including the CLIP text encoder (see 2.3.9), Text-conditional image generation (see 2.4.3), VAEs (see 2.5.1), and LDMs (see 2.5.2). The model initiates by employing the CLIP text encoder (a transformer, see 2.3.7) to obtain the text embedding, which serves as a vector representation encapsulating the semantic information embedded in the text prompt [4]. Following the text embedding, the model’s core component, the diffusion model. The diffusion model is of a U-Net architecture but includes connections that add the text embeddings at every encoding and decoding stage (see 2.6). However, rather than operating directly on pixel-level image data, the U-Net operates on a latent image representation, i.e., an LDM. This latent image representation is produced through a VAE, which during interference, decodes the latent image representation into the pixel space. During training, VAE compresses the high-dimensional images of the training dataset into a lower-dimensional latent space, enabling the model to work with more abstract and dense representations of the input data. This results in the relatively low computational requirements of the model [4]. Likely because the model is open-source, there is a large number of features and applications that have been developed for this model, for example, a large number of schedulers and pipelines, i.e., complete model flows, for example, image-to-image [25] [26]. 19 2. Theory Figure 2.9: A overview of the Stable Diffusion model architecture. The "Latent Space" is the lower dimensional vector representation of images, "Pixel Space" is the pixel representation, and "Prompt Space" is the string representation of the prompt. There are two paths in "Latent Space"; one is for interference, and one is for training. The interference path starts from the seed-generated noise and the prompt, whereas the training path starts with an image from the training dataset 20 3 Methods This chapter motivates and describes the implementations of the input-output map- pings in this work. A proposed method of control is also presented in this chapter. 3.1 Software environment This project utilized Python, an industry-standard language in machine learning (ML) and data science, primarily due to its vast array of robust packages and its simple, versatile syntax. The Stable Diffusion model, an open-source deep learning model, was leveraged in this project. Opting for this model was driven by several factors. First, its open- source nature grants complete access to all sub-components of the model, promot- ing a deeper understanding of its workings and enabling customization for greater flexibility. Second, as a large pre-trained model, Stable Diffusion is capable of gen- erating realistic images. This allowed the project to focus on exploring and refining the input-output mappings, rather than investing time and resources in training a new model from scratch. PyTorch was selected as the primary ML library primarily for its seamless integration with the Stable Diffusion model via the Hugging Face platform. Although PyTorch offers a flexible and efficient environment for implementing various ML models, its specific advantages in this project revolve around its compatibility with the chosen model and platform. Lastly, the Hugging Face platform was employed as it offers convenient access specif- ically to the Stable Diffusion model. The use of Hugging Face simplified the process of leveraging the model in this project and provided a comprehensive set of tools for fine-tuning and experimentation. The chosen software environment, in its entirety, was designed to promote a trans- parent, efficient, and effective execution of the project tasks. 3.2 Creation of an Object-Oriented Pipeline The initial stage of the project was devoted to constructing an object-oriented pipeline for generating images from text using the Stable Diffusion model. The design of this pipeline was geared towards enabling convenient access to all inter- mediate stages of the model’s input/output. The pipeline was architectured around two fundamental objects: "Prompt" and "ImageNoise". 21 3. Methods Object Description Prompt The Prompt object encapsulates both the string and vector representation of the text input. The machine- encoded form of the prompt that guides the Stable Dif- fusion model is included in this encapsulation. The main reason for creating this object was to provide easy ac- cess and the ability to manipulate the prompt during different stages of the pipeline. ImageNoise The ImageNoise object represents the latent variable that the Stable Diffusion model operates upon. This latent representation can either be a pre-existing image, the output image, or a noise pattern, depending on the specific step in the latent diffusion process. The creation of the ImageNoise object aimed to provide an efficient method for accessing and manipulating this latent rep- resentation during different stages of the pipeline. Table 3.1: Key objects in the pipeline These objects form the backbone of the pipeline, providing a transparent and flexible framework for the project. This structure not only demystifies the inner workings of the Stable Diffusion model but also facilitates granular access to critical stages of the image generation process, supporting a more effective exploration of input-output mappings. 3.3 Input-Output Mapping of the Diffusion Model modules In order to gain a deeper understanding of the image generation process facilitated by the Stable Diffusion model, an assessment of each module or stage was undertaken. This was accomplished by generating an input-output mapping for each distinct phase of the model’s operation. Input-output mappings serve to illuminate the functionality of each stage, tracing the transformation of data as it passes through a particular module. This pro- vides insights into the module’s contribution to the final image generation. These mappings may hint at potential avenues for improving control of the model. The primary goal of this exploration is to enhance the understanding of the Stable Diffusion model. Additionally, it could potentially guide targeted improvements in the model, should the mappings suggest a feasible path for such enhancements. 3.3.1 Word Attention To analyze the model’s attention at the word level, the subword-level attention scores were first aggregated into a word-level attention map. This map encapsulates how much attention each word in the prompt receives from other words i.e. the token or 22 3. Methods subword representation was converted into words: Word-level Attention Map i, j = Combine Tokens(Attention Map i, j) Here, Attention Mapi,j represents the original attention matrix from the transformer model. The indices i and j correspond to the respective words in the input. The word attention was then computed for each word i by summing the attention it receives from all other words, and itself, across a single head: Word Attention i = Sum(Word-level Attention Map i, j) To gain insight into the total average attention each word receives across all heads in all layers of the transformer model; the mean of the word attention, described above, was then computed: Average Word Attention i = Mean(Word Attention i) Finally, these average word attentions were visualized as a bar plot. This represen- tation allows us to understand more clearly how much attention each word is given by the model when generating images. This could potentially guide us in crafting more effective prompts. 3.3.2 Smoothly adjusting the prompt The U-Net that is the core of the diffusion model does not directly determine the image and is instead applied iteratively to make a prediction of the noise in the image by the scheduler. This introduces complexity when it comes to visualizing the effect of the prompt on the image; to circumvent this complexity, the vector representation of the word token was manipulated before inputting it into the U-Net. To do this, the vector corresponding to the position of a given word was adjusted. This approach assumes that the CLIP encoder does not change the position of the information connected to a word (which is supported by the results of Word Attention, see 3.3.1 and 4.1.2). Note that although the primary focus of these adjustments is for output control, this also serves as a test of the assumption. 3.3.2.1 Adjustments for control First, this approach was evaluated as a means for smooth control of the generated image. This was done by increasing the norm of the word vector representation for different words and with different percentage increases. New Word Embedding i = (1 + ϵ)Prompt Embeddings i Where i is the word index, Prompt Embeddings is the vector representation of the prompt indexed by word, and ϵ is the change. Note that these embeddings can potentially be adjusted in different ways; for ex- ample, the encoder often learns to map similar words to similar embeddings, so one might move the vector toward a specific word. This is however outside the scope of 23 3. Methods Hyperparameter Value Number of steps 250 Guidance Scale 7.5 Height of image 512 Width of image 512 Stable Diffusion version CompVis/stable-diffusion-v1-4 CLIP version openai/clip-vit-base-patch32 Table 3.2: Hyperparameter used during the "Adjustments for control" evaluation. this project so that the norm will be adjusted as a "neutral" choice of adjustment in this project. This table shows the hyperparameter used during the image generation. For this word embedding adjustment. 3.3.2.2 Comparing schedulers The chosen scheduler does also have an effect on the resulting image, as well as the generation speed. To investigate this, the above-mentioned adjustment was done with three different main types of schedulers: DiscLMS, DDIM, and PNDM. 3.3.3 Prompt-to-Image Mapping For the purpose of understanding the effect of each word, in the prompt, on the image output. The smooth representation from above will be leveraged to create a prompt-to-image map. This was attempted with a perturbation-based mapping, i.e., changing the prompt by a small amount and mapping the difference, and gradient mapping through finite difference, i.e., making a very small change to the prompt and scaling up the result to approximate the gradient. Settings Value Scheduler PNDM Number of steps 250 Guidance Scale 7.5 Height of image 512 width of image 512 Stable Diffusion version CompVis/stable-diffusion-v1-4 CLIP version openai/clip-vit-base-patch32 Table 3.3: Settings used for the diffusion model when computing the image for prompt-to-image map. 3.3.3.1 Perturbation Mapping For the "Perturbation map", a smooth prompt adjustment was done in the same way as in the section 3.3.2, for each word in the prompt. The adjustment was done 24 3. Methods with a ϵ = 0.15. Then the image was computed using the diffusion module for both the un-adjusted and adjusted prompts. Then the operation difference from the PIL Python package ImageChops was used to find the difference between the image corresponding to the un-adjusted and the adjusted prompts. This operation calculates the absolute value of the pixel-by-pixel difference between the two input images. To display this result the "difference" images, given by the operation difference were first turned black-and-white. Then the images were given an individual color using the HSV color wheel, and the alpha was set to the intensity of the black-and-white image. Lastly, these images were overlaid in two different ways, by setting the pixel to the color corresponding to the "difference" image with the highest intensity at that pixel, or simply all together. 3.3.3.2 Gradient Mapping through Finite difference The gradient was approximated in the image space, where an image was represented as a PyTorch tensor of the dimensions 3×Image height in pixels×Image height in pixels. The difference to the input was applied as a small adjustment (i.e. 3.3.2), ϵ = 0.01 (This can be compared to the ϵ = 0.15 that was used in the perturbation map 3.3.3.1). The reason why a smaller ϵ was not used was that this seemed to make the mapping very noisy. Approximate Gradient i = (Adjusted Image i − Original Image)/ϵSize Approximate Gradienti is the approximate gradient for a given word with index i, Adjusted Imagei is the image adjusted for this word, and ϵSize was calculated using the Linalg norm opartor from PyTorch. The approximate gradient was then converted to a Image height in pixels×Image height in pixels matrix by calculating the norm of the tensor at the pixel locations, and the values were set so the minimum value was 0 and the maximum 255 i.e. corresponding to the standard color range in r,g,b. The matrix was then converted into a black-and-white Python PIL image and displayed in the same way as in the section 3.3.3.1. 3.3.4 Noise-to-Image Mapping A specific image generated by the Stable Diffusion model, and diffusion models, in general, are highly influenced by the exact starting noise, or in other words, the latent vector that the models start with i.e. the noise instance. This is what makes these models "creative". Thus to completely investigate input-to-output mappings for every module the effect of the noise should also be investigated. Quantifying the effect of the noise given a specific prompt was relatively straightfor- wardly done by feeding the generated image sample into the images encoder. The similarity score was then calculated using the CLIP processor accessed through Hug- ging Face. The similarity score is the scaled dot product scores between the image embeddings and text embeddings. This score is not directly related to any similarity measure but higher values correspond to a more similar prompt and image, and can thus be used to compare different image samples, generated with the same prompt. 25 3. Methods Finally to measure the "impact of noise", depending on the prompt a number of images were generated for 5 prompts of an increasing level of relative "abstraction", for example, "a photograph of an astronaut in space" is less abstract than "a photo- graph of an astronaut riding an elephant". The similarity score for the prompt and image sample was then computed and displayed as a moving average, i.e., the first score only included that score as the "average" and the second was the average of the first and second score, the standard deviation of the data was also included in the score plot. 3.4 Proposed method of control: Movement-to- Image Motivated by the instability of the diffusion model, i.e., sensitivity to noise (and how well the relatively simple method of adjustments of word embedding worked, see 3.3.2 and 4.1.3). A potential method may be to, instead of trying to map the segments of the input to the output for the purpose of greater control; one could instead map in human understandable points into a space with large control opportunities. That is mapping in similar, for example, prompts into the vector embedding space, then reduce the dimensionality, and then move toward or away from these known points. Algorithm 1 Movement-to-Image pipeline 1: Generate similar inputs - Generate similar prompts ex. through an LLM (For the proof-of-concept this was hardcoded) 2: Map all inputs to the embedding space - Generate the embeddings (This was done through the CLIP encoder) 3: Reduce dimensionality - Create a PCA plot 4: Move the input in the embedding - Through some joystick-like input get user input to adjust the input embedding (For the proof-of-concept this was hard- coded) This gives a smooth interface for input manipulation leveraging other inputs as a sort of "lighthouse" giving meaning to the movements in the embedding space. Effectively circumventing any need to understand the values in the embedding space but still allowing meaningful and potentially predictable smooth adjustments. 26 4 Results This section walks through all the main results obtained in this project. These include text-impact mapping, which refers to the impact that specific words had on the prompt representation. Additionally, it covers word-to-image mapping, i.e., the effect of the prompt on the image output, and noise-impact mapping, which relates to the effect of the second input, noise, and its corresponding impact. 4.1 Text-impact mapping This section includes different attention measures to determine which words have a relatively large impact on the prompt vector representation. As well as, maybe more importantly, how the CLIP text encoder shifts and processes the information contained in the prompt. 4.1.1 Standard attention map Starting off with a sanity check and characterization of how the CLIP text encoder handles words in different heads. We can see from the images 4.1 that the in- formation contained in the words seems to stay roughly at the same place, i.e., the attention the model gives to each token seems to be preserved at the places occupied by word tokens, instead of empty tokens. However, like most methods for evaluating how an ML-model makes its decisions, this is a guess, although motivated. Specifically, figure 4.1 shows the amount of attention placed on a token when evalu- ating another token. The different images in the figure correspond to different masks in different layers for the same prompt. 27 4. Results (a) (b) (c) (d) Figure 4.1: The full attention maps of four heads in different layers i.e. the cross attention for all tokens. Zooming into the tokens that represent words, in the prompt "a photograph of an astronaut riding a horse", we can, of course, evaluate how much a word impacted another in a specific layer. We can also see that some words seem to be more important in general and not just for a single head. 28 4. Results Figure 4.2: A zoomed and cropped attention map for one of the heads from the first layer. 29 4. Results (a) (b) (c) (d) Figure 4.3: The attention maps of four heads in different layers, the heads are the same as in 4.1. 30 4. Results 4.1.2 Average word attention A more useful way to view the attention maps when gauging the impact of a word on the prompt vector representation, and thus the image, maybe to sum up the cross attention. This gives a measurement showing how much other tokens and the token itself is influenced by the token. Figure 4.4 shows this for every layer for the prompt "a photograph of an astronaut riding a horse". Figure 4.4: The attention score of one of the heads from the first layer. Figure 4.5 shows the attention map or score averaged through all layers and heads for a few prompts. We can see from this that the score seems to align with the impact of each word token on the image for example, word tokens that can’t be displayed in an image, for example, "a" seem to have low scores, and substantives that are included have high scores. Thus the average word token score seems to correspond roughly to the effect on an image. It should be noted that this does not apply to all words; for example, "horse" has a relatively low score, although "riding" and "horse" together do have a high score. 31 4. Results (a) (b) (c) (d) Figure 4.5: The average attention score of all heads and all layers. 4.1.3 Word weight adjustment This section contains the results that build on the average word attention results presented above. It thus this by working from the assumption the word (tokens) thus not change position through the clip encoder. This is only a working assumption and seemingly, at best, an approximation. Using this assumption, the prompt was manipulated for different words. During this process, the effect of the choice of scheduler was also investigated. 4.1.3.1 Discrete LMS Scheduler Below (4.6, 4.7 and 4.8), we can see that the embeddings manipulation seems to correspond well to what would be expected by the assumption. When increasing the "importance" of the word "suit" in the image with the golden cyborg, the body seems to become more "suit" like. When the "importance" of the word "suit" is decreased, the body seems to become more cyborg/robot-like, or in other words, less "suit" like. Similar results can be seen for the words "photograph" and "astronaut" for the 32 4. Results prompt "a photograph of an astronaut riding a horse", so this seems to be a good av- enue for increasing control of the model, seemingly allowing for smoothly adjusting the prompt until it fits what the user was looking for. (a) (b) (c) (d) (e) (f) (g) Figure 4.6: Image generation results for "a portrait of a cyborg in a golden suit, concept art", adjusted for the word: "suit". (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6 (Note that a light Stable Diffusion variant was used for computational efficiency, this does however reduce image quality). (a) (b) (c) (d) (e) (f) (g) Figure 4.7: Image generation results for "a photograph of an astronaut riding a horse", adjusted for the word: "photograph". (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. 33 4. Results (a) (b) (c) (d) (e) (f) (g) Figure 4.8: Image generation results for "a photograph of an astronaut riding a horse", adjusted for the word: "astronaut". (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. 4.1.3.2 DDIM Scheduler We can see here in the images below that the difference seems to be very small between LMS and DDIM, at least in the context of prompt adjustment (more images can be found in the appendix). (a) (b) (c) (d) (e) (f) (g) Figure 4.9: Image generation results for "a portrait of a cyborg in a golden suit, concept art", adjusted for the word: "suit" with different epsilon values using the DDIM scheduler. (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. 34 4. Results 4.1.3.3 PNDM Scheduler The PNDM scheduler seems to perform similarly to LMS and DDIM, at least in this context of prompt adjustment (more images can be found in the appendix). (a) (b) (c) (d) (e) (f) (g) Figure 4.10: Image generation results for "a portrait of a cyborg in a golden suit, concept art", adjusted for the word: "suit" with different epsilon values using the PNDM scheduler. (a) ϵ = −0.6. (b) ϵ = −0.4. (c) ϵ = −0.2. (d) ϵ = 0.0. (e) ϵ = 0.2. (f) ϵ = 0.4. (g) ϵ = 0.6. 35 4. Results 4.2 Word-to-Image Mapping This section includes the results of word-to-image mappings. 4.2.1 Perturbation Mapping Below is an image showing the effect of a +15% perturbation to every word in the prompt; the prompt in question is "a photograph of an astronaut riding a horse". The image shows the color corresponding to the word with the largest change per pixel, i.e., for example, pink on a pixel if the difference image corresponding to an adjustment of the word "horse" had the largest norm for the color vector at that pixel. We can see that the words seem to be loosely related to the relevant objects in the image. However, one can also see that the words have effects on many "unrelated" parts of the image; for example, the word "astronaut" has large effects on the background. Figure 4.11: Visualization showing the effect of perturbations on the prompt "a photograph of an astronaut riding a horse", the perturbation size was +15% in this image. The next (4.12) shows the same perturbation-based mapping as in the above image 36 4. Results (4.11). However, unlike the above image here, the pixels are simply blended or added on top of each other but with an alpha value corresponding to the magnitude of change per pixel. The alpha value corresponds to how see-through a color is when a color is defined as [r, g, b, alpha]; thus, the color representing a word with little impact will be very see-through and won’t have a large effect on the mapping. Figure 4.12: Visualization constructed showing the effect of perturbations on the prompt "a photograph of an astronaut riding a horse", the perturbation size was +15% in this image, and the perturbation differences were directly overlaid. 37 4. Results "a" "photograph" "of" "an" "astronaut" "riding" "a" "horse" Figure 4.13: Images showing the effect of +15% size perturbations for each word in the text prompt "a photograph of an astronaut riding a horse". 38 4. Results 4.2.2 Gradient Mapping The image 4.14 shows the results of the gradient-based mapping, approximated through the finite difference method, for the prompt "a photograph of an astronaut riding a horse". The image shows the color of the word with the largest effect on each pixel. In the image, one can see that although there seems to be some structure to the word-to-image mapping, it is much more "random", i.e., the words have more effect on "unrelated" parts of the image. Figure 4.14: Visualization constructed in Python showing the "Saliency Map" of text prompts impact on the image output (approximated through the Finite difference method). The image 4.15 is the result of the gradient-based mapping, but every color is blended on top of each other, with the alpha value corresponding to the effect a word had on a specific pixel. One can see in this image that although there are large effects on unrelated parts of the image, at least some words seem to correspond heavily to the expected part of the image. For example, the word "horse" has most of its effect on the horse in the image. 39 4. Results Figure 4.15: Visualization constructed showing the "Saliency Map" of text prompts impact on the image output, the differences are directly overlaid. 40 4. Results "a" "photograph" "of" "an" "astronaut" "riding" "a" "horse" Figure 4.16: The "gradients" in image format computed during the Finite difference method for each word in the text prompt "a photograph of an astronaut riding a horse". 41 4. Results 4.3 Noise-impact mapping Below (4.17) is a plot consisting of the moving average (with an infinite window), as well as the standard deviation, of the similarity score of 10 for different images for 5 prompts. One can see that the average similarity score does not seem to correspond to anything representing some "structure" in the prompt; that is, all prompts have similar averages even though they were intentionally chosen to vary in how "abstract" they are, for example, the prompt "a photograph of an astronaut in space" is much less abstract than "a photograph of an astronaut riding a horse in the prism". Furthermore, the prompt "a photograph of an astronaut riding a horse in the prism" is in the middle of some of the more "normal" prompts. This suggests that the diffusion model is able to express the information continued in the CLIP embeddings regardless of the prompt, i.e., the embeddings for the generated image are close to the embeddings of the prompt despite some prompts being presumably harder to visualize. Figure 4.17: The moving average (the line) of the similarity score, as well as the standard deviation (with an infinite window), computed by the CLIP model. Looking closer at the standard deviation in figure 4.18, one can see that it might correspond loosely to how "abstract" a prompt is having the most normal prompt at the bottom, although this number of samples is too few to make a judgment. 42 4. Results Figure 4.18: The moving standard deviation of the similarity score, computed by the CLIP model. Evaluating if the noise-impact mapping illuminates something about the prompt that can be seen in the generated images, we have a few examples for the prompts in figure 4.19. One can see here that the standard deviation does seem to correspond to a "real" structure in the generated images for the different prompts, i.e., they are more varied for the prompts with higher standard deviations. 43 4. Results (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) Figure 4.19: Examples of prompt-dependent variation. (a)-(d) Images generated for the prompt "a photograph of an astronaut in space". (e)-(h) Images generated for the prompt "a photograph of an astronaut riding an elephant". (i)-(l) Images gen- erated for the prompt "a photograph of an astronaut riding a horse in the desert". (m)-(p) Images generated for the prompt "a photograph of an astronaut riding a horse in the prism" (Note that a light Stable Diffusion variant was used for compu- tational efficiency, this does however reduce image quality). We can thus, in total, see that the standard deviation of the similarity score seems to both be expressed in the generated images and may be caused by how "abstract" the prompt is. And also that the diffusion model seems to be able to express the information contained in the embeddings, which, together with the varying quality in the images, seen in 4.19 might suggest that the CLIP model is the "weak link" in the Stable Diffusion model, i.e., lower quality images may correspond to "lower" 44 4. Results quality prompt information in the embeddings. 4.4 Movement-to-Image pipeline Proof-of-Concept Below we can see the results for the adjusted prompt embedding next to the relevant low-dimensionality representation, i.e., the interface. We can see that this proof-of- concept approach (see 3.4) seems to work relatively well; that is, the manipulation directly in the lower dimensional representation of the embedding space, which serves as the interface, does seem to correspond to the expected change in the image. In the images below (4.20 and 4.21), we can see the interface as well as the generated image for the input "a picture of a space explorer galloping on a horse" (the red dot). Figure 4.20: A image showing part of the movement-to-image pipeline proof-of- concept. The plot is the interface; in this case, a PCA plot on the embedding space, and the red dot is the input embedding. 45 4. Results Figure 4.21: A image showing part of the movement-to-image pipeline proof-of- concept. The image is the generated image corresponding to the interface 4.20 and the prompt "a picture of a space explorer galloping on a horse". In the next image (4.22 and 4.23), we can see how making the starting input (the red dot) which was "a picture of a space explorer galloping on a horse" more similar to "a photograph of an astronaut riding a horse" actually makes the image include something similar to an "astronaut" while not changing the background much. This seems to correspond to a well-behaved interface as "astronaut" is the big difference in the prompts mentioned from the interface. 46 4. Results Figure 4.22: A image showing part of the movement-to-image pipeline proof-of- concept. The plot is the interface and the red dot is the adjusted input embedding. 47 4. Results Figure 4.23: A image showing part of the movement-to-image pipeline proof-of- concept. The image is the generated image from adjusted embedding corresponding to the interface 4.22 (Note that a light Stable Diffusion variant was used for com- putational efficiency, this does however reduce image quality). 48 5 Discussion This section discusses some of the results that seem to be more significant. 5.1 Prompt consistency through the Encoder lay- ers To some degree, it is unexpected that the word attention score seems to reflect the amount of importance placed on a word (although from the results in the report, it is not known if this assumption is true). Normally neural networks are not expected to keep the placement of information in a way that reflects the input, and attention maps are not a direct reflection of the network output. But still, the words-related representation seems to have the same placement as in the input string i.e. average word attention 4.1.2 seems to reflect the expected importance of words and word weight adjustment 4.1.3 seems to result in the expected change. A possible explana- tion for this is the residual connection in the transformer architecture of the CLIP encoder. It seems reasonable that the model wouldn’t be able to move the word information as the residual connection would interfere with the moved information. 5.2 Perturbation vs Gradient input-to-output map- pings The most significant characteristic difference between the perturbation and gradient input-to-output map seems to be the large word-specific region pixel changes i.e. the large changes not on the horse for the word "horse". This may be because it would be reusable for the CLIP word embeddings to only hold information on the content of the image without or with very little information on the composition, and thus the diffusion model is responsible for all information on the composition of the image. If we assume that this is true it would be reasonable that an infinitesimal change of the prompt embeddings would not change the content information much but would serve as a slightly different starting point for the diffusion model, and due to how sensitive the model is to noise i.e. different seeds produce very differently composed images. This slightly different starting point may still result in large composition information changes i.e. many pixel changes (caused by something like the outlines of the background moving one pixel). This may explain why the larger changes in the perturbation mapping seem to produce the biggest changes on the content. 49 5. Discussion 5.3 Noise-impact mapping and Common sense An interesting area of continuing research may be if the noise-impact mapping could be improved to the degree that it actually could measure how "difficult" it is for the model to construct an image for a given prompt this might be able to serve a more general purpose detecting something related to "how much sense the prompt makes" given how general these types of models are. Similar to how Autoencoders can be used for anomaly detection. 50 6 Conclusion This project foremost aimed to investigate if input-to-output mappings could be produced for the Stable Diffusion model; this was successful from the viewpoint of the modules, which are of particular interest when it comes to control as they provide many intermediate representations. However, no complete input-to-output mapping for the diffusion model was produced in this project. The next question was if control opportunities could be identified from the input- to-output mappings; in this area, it was found that the CLIP embeddings seemed to be remarkably consistent in structure with the prompt; this led to the possible control opportunity of changing the CLIP embeddings directly (3.3.2). It was also noticed from comparing Perturbation based embeddings-to-image mappings with Fi- nite difference-based embeddings-to-image mappings that relatively large (smooth) changes seemed to correspond better to what may be expected compared to small (smooth) changes (4.2). The last question was if these insights could be used to construct actual (proof-of- concept) control methods. In this area, smooth word importance adjustments were made and seemed to correspond to what was expected (4.1.3). A proof-of-concept for dimensionality reduction-based, general embedding smooth adjustments method was also constructed and showed some promise when considering how potentially general it is (4.4). The proof-of-concept focus inherently means the project’s most useful results may be as a catalyst for ideas in other works. To aid with this, the repository is public on GitHub at https://github.com/philipgrd/ControllingDiffusion. 51 https://github.com/philipgrd/ControllingDiffusion 6. Conclusion 52 Bibliography [1] Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back- propagating errors. Nature. 1986;323(6088):533-6. [2] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278-324. [3] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Communications of the ACM. 2020;63(11):139-44. [4] Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-Resolution Image Synthesis with Latent Diffusion Models. CoRR. 2021;abs/2112.10752. Available from: https://arxiv.org/abs/2112.10752. [5] Bank D, Koenigstein N, Giryes R. Autoencoders. CoRR. 2020;abs/2003.05991. Available from: https://arxiv.org/abs/2003.05991. [6] Ribeiro E, Ribeiro R, de Matos DM. A Study on Dialog Act Recognition using Character-Level Tokenization. CoRR. 2018;abs/1805.07231. Available from: http://arxiv.org/abs/1805.07231. [7] Toraman C, Yilmaz EH, Ş ahi nuç F, Ozcelik O. Impact of Tokenization on Language Models: An Analysis for Turkish. ACM Transactions on Asian and Low-Resource Language Information Processing. 2023 mar;22(4):1-21. Available from: https://doi.org/10.1145% 2F3578707. [8] Mielke SJ, Alyafeai Z, Salesky E, Raffel C, Dey M, Gallé M, et al. Be- tween words and characters: A Brief History of Open-Vocabulary Model- ing and Tokenization in NLP. CoRR. 2021;abs/2112.10508. Available from: https://arxiv.org/abs/2112.10508. [9] Almeida F, Xexéo G. Word Embeddings: A Survey. CoRR. 2019;abs/1901.09069. Available from: http://arxiv.org/abs/1901.09069. [10] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. CoRR. 2017;abs/1706.03762. Available from: http: //arxiv.org/abs/1706.03762. [11] Chefer H, Gur S, Wolf L. Generic Attention-model Explainability for Interpret- ing Bi-Modal and Encoder-Decoder Transformers. CoRR. 2021;abs/2103.15679. Available from: https://arxiv.org/abs/2103.15679. [12] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR. 2020;abs/2010.11929. Available from: https://arxiv.org/ abs/2010.11929. 53 https://arxiv.org/abs/2112.10752 https://arxiv.org/abs/2003.05991 http://arxiv.org/abs/1805.07231 https://doi.org/10.1145%2F3578707 https://doi.org/10.1145%2F3578707 https://arxiv.org/abs/2112.10508 http://arxiv.org/abs/1901.09069 http://arxiv.org/abs/1706.03762 http://arxiv.org/abs/1706.03762 https://arxiv.org/abs/2103.15679 https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 Bibliography [13] Jamil S, Jalil Piran M, Kwon OJ. A comprehensive survey of transformers for computer vision. Drones. 2023;7(5):287. [14] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learn- ing Transferable Visual Models From Natural Language Supervision. CoRR. 2021;abs/2103.00020. Available from: https://arxiv.org/abs/2103.00020. [15] McAllester D. On the Mathematics of Diffusion Models. arXiv preprint arXiv:230111108. 2023. [16] Chen T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:230110972. 2023. [17] Song J, Meng C, Ermon S. Denoising Diffusion Implicit Models. CoRR. 2020;abs/2010.02502. Available from: https://arxiv.org/abs/2010.02502. [18] Karras T, Aittala M, Aila T, Laine S. Elucidating the design space of diffusion- based generative models. arXiv preprint arXiv:220600364. 2022. [19] Liu L, Ren Y, Lin Z, Zhao Z. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:220209778. 2022. [20] Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. CoRR. 2021;abs/2112.10741. Available from: https: //arxiv.org/abs/2112.10741. [21] Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:13126114. 2013. [22] Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-Resolution Image Synthesis with Latent Diffusion Models. CoRR. 2021;abs/2112.10752. Available from: https://arxiv.org/abs/2112.10752. [23] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR. 2015;abs/1505.04597. Available from: http://arxiv.org/abs/1505.04597. [24] Peebles W, Xie S. Scalable Diffusion Models with Transformers. arXiv preprint arXiv:221209748. 2022. [25] Contributors HFSDM. Pipelines. Hugging Face; 2023. Accessed: 2023-05-29. Available from: https://huggingface.co/docs/diffusers/api/pipelines/ overview. [26] Contributors HFSDM. Schedulers. Hugging Face; 2023. Accessed: 2023- 05-29. Available from: https://huggingface.co/docs/diffusers/api/ schedulers/overview. 54 https://arxiv.org/abs/2103.00020 https://arxiv.org/abs/2010.02502 https://arxiv.org/abs/2112.10741 https://arxiv.org/abs/2112.10741 https://arxiv.org/abs/2112.10752 http://arxiv.org/abs/1505.04597 https://huggingface.co/docs/diffusers/api/pipelines/overview https://huggingface.co/docs/diffusers/api/pipelines/overview https://huggingface.co/docs/diffusers/api/schedulers/overview https://huggingface.co/docs/diffusers/api/schedulers/overview A Appendix 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure A.1: Full attention maps of all heads in the first layer i.e. the cross attention for all tokens at every layer. I A. Appendix 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure A.2: Zoomed and cropped attention maps for all of the heads in the first layer, providing a comparison between heads. II A. Appendix 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure A.3: The attention scores of all heads in the first layer. III A. Appendix 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) Figure A.4: Other image generation results with different epsilon values using the DDIM scheduler. The top two rows show the embeddings adjusted for the word: "photograph" and the bottom two rows show the embeddings adjusted for the word: "astronaut". For each set of images from left to right: (a) ϵ = −0.6, (b) ϵ = −0.4, (c) ϵ = −0.2, (d) ϵ = 0.0, (e) ϵ = 0.2, (f) ϵ = 0.4, (g) ϵ = 0.6. IV A. Appendix 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) Figure A.5: Other image generation results with different epsilon values using the PNDM scheduler. The top two rows show the embeddings adjusted for the word: "photograph" and the bottom two rows show the embeddings adjusted for the word: "astronaut". For each set of images from left to right: (a) ϵ = −0.6, (b) ϵ = −0.4, (c) ϵ = −0.2, (d) ϵ = 0.0, (e) ϵ = 0.2, (f) ϵ = 0.4, (g) ϵ = 0.6. V DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms List of Figures List of Tables Introduction Background Aim Problem description Limitations Specification of issues under investigation Ethical Considerations Misuse Risk Impact on Artistic Professionals Theory Overview and Context Stable Diffusion architectural overview Theoretical framework Autoencoders Converting prompts into Text embeddings Tokenization Word Embedding Positional Encoding Attention Multi-Head Attention Attention Maps Transformer Vision Transformer CLIP Diffusion Models Base Diffusion Model Schedulers Denoising Diffusion Implicit Models Linear Multistep Schedulers Pseudo Numerical Methods GLIDE Efficient image representations VAE Latent Diffusion Models U-Net The Stable Diffusion Model Methods Software environment Creation of an Object-Oriented Pipeline Input-Output Mapping of the Diffusion Model modules Word Attention Smoothly adjusting the prompt Adjustments for control Comparing schedulers Prompt-to-Image Mapping Perturbation Mapping Gradient Mapping through Finite difference Noise-to-Image Mapping Proposed method of control: Movement-to-Image Results Text-impact mapping Standard attention map Average word attention Word weight adjustment Discrete LMS Scheduler DDIM Scheduler PNDM Scheduler Word-to-Image Mapping Perturbation Mapping Gradient Mapping Noise-impact mapping Movement-to-Image pipeline Proof-of-Concept Discussion Prompt consistency through the Encoder layers Perturbation vs Gradient input-to-output mappings Noise-impact mapping and Common sense Conclusion Appendix 1