Hybrid Quantum Transformer for Natural Language Processing A hybrid transformer model using a quantum kernel as the attention mechanism Master’s thesis in Complex Adaptive Systems WESLEY CONCEPCION DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 QNLP: A Hybrid Quantum Transformer for Sentiment Analysis An evaluation of a quantum kernel self attention mechanism to capture higher order semantic relationships for natural language processing. Wesley Concepcion Department of Physics Chalmers University of Technology Abstract Large language models, based on the transformer architecture, have revolutionized the field of Natural Language Processing by using deep learning techniques to cap- ture complex linguistic patterns. This thesis explores a hybrid architecture for a transformer model for natural language processing. Specifically, a hybrid quantum transformer which incorporates a quantum self-attention mechanism for sentiment classification on the IMDB dataset. The hybrid quantum transformer leverages the strengths of both classical and quantum computing to enhance the performance of the sentiment analysis for natural language processing. The model aims to cap- ture more complex semantic relationships and addresses the potential of quantum computing compared to classical computation. The quantum self-attention module computes the similarity measure between input tokens by first embedding the data in the Hilbert Space of quantum states, followed by an inner product between quan- tum states. Comparative analyses are performed against current work on quantum and classical transformer architectures for sentiment analysis. The quantum ker- nel archictecture is benchmarked against an attention-less transformer and previous implementations for a quantum transformer [27]. Although the quantum kernel transformer performed marginally better in terms of accuracy on the test set over five epochs, the runtime for a simplified quantum kernel transformer significantly highlighted the drawbacks for kernel computations. The long run-time ruled out the feasibility for a more complex set of hyperparameters to match the capabilities of a classic transformer. Future work would explore tuning of the hyperparameters and adding a trainable parameterized layer in the quantum circuit for the kernel to allow for a trained element to the architecture. iv Acknowledgements This thesis project would have not made it very far without the guidance from my advisor, David Fitzek and my supervisor, Mats Granath. I appreciate all of their help and insights throughout the course of the project. I would also like to thank my classmates for support and my opponent Tarek Alhaskir for his valuable questions and interrogations on my presentation and topic. Lastly, I would like to thank my family, and my partner, Anna Rosén for all of their continued support and love they have given. Computations were enabled by resources provided by the National Academic In- frastructure for Supercomputing in Sweden (NAISS) and the Swedish National In- frastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE), partially funded by the Swedish Research Council through grant agreements no. 2022-06725 and no. 2018-05973. Wesley Concepcion, Gothenburg, June 2023 vi viii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Kernel Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Quantum Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Qubits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Qubit Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 Quantum Data . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Quantum Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Variational Quantum Circuit . . . . . . . . . . . . . . . . . . . 17 2.4.2 Quantum Kernel Methods . . . . . . . . . . . . . . . . . . . . 18 3 Framework 21 3.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Classic Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Positional Encoding and Word Embeddings . . . . . . . . . . 27 3.2.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.4 Feed Forward Network . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Hybrid Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Simple Quantum Attention . . . . . . . . . . . . . . . . . . . 29 4 Evaluation 31 4.1 IMDB dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 Classical transformer . . . . . . . . . . . . . . . . . . . . . . . 34 ix Contents 4.3.3 Variational Quantum Circuit . . . . . . . . . . . . . . . . . . . 36 4.4 Quantum Kernel Attention . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Conclusion 43 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliography 44 A Appendix 1 I B Appendix 2 III B.1 Classic Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . III B.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III B.1.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII B.2 Variational Quantum Circuit . . . . . . . . . . . . . . . . . . . . . . . X B.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X B.2.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XII B.3 Scaled down Quantum kernel . . . . . . . . . . . . . . . . . . . . . . XIV B.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIV B.3.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XV B.4 Attention-less Transformer . . . . . . . . . . . . . . . . . . . . . . . . XVII C Appendix 3 XIX C.1 IMDb Movie Review Data Analysis . . . . . . . . . . . . . . . . . . . XIX x List of Figures 1.1 Example diagram for a hybrid machine learning model. In this figure, the classical computer does the pre and post processing of the data, in addition to adjusting the learning rates. The classical computer feeds the quantum computer with an initial state and quantum com- puter performs the measurements after the final state is reached and feeds that measured state back to the classical computer. The final prediction for the given prompt is then given as a result to the user. 2 1.2 Comparing kernel methods mapping to a feature space, and quantum computing mapping to Hilbert space. . . . . . . . . . . . . . . . . . . 3 1.3 A simple block diagram of the full encoder/decoder tranformer. . . . 4 2.1 Example response from ChatGPT (GPT-3). This is an example of question/answer, a sub-category of natural language processing. . . . 7 2.2 Basic binary classifier, which categorizes a two featured input data in to one of two classes. It can be seen that the green linear line separates the two classes and the classes are subsequently labeled as Class 0 or Class 1 depending on where they fall above or below the classification line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 The Bloch sphere is a geometric representation of the pure state space of a two-level quantum mechanical system (qubit). Each point on the surface of the sphere corresponds to a unique quantum state. The north and south poles typically represent the basis states |0〉 and |1〉, respectively, while any other point represents a superposition of these states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Generic block diagram for a variational circuit. The weights are trained as tunable rotations for each qubit embedding, and we can stack these circuits to increase the capability of the model. . . . . . . 18 2.5 The unitary, Uϕ(x), is composed of the conventional Hadamard, and Uϕ(x), which is diagonal in the Pauli Z-basis. The circuit will act on |0⟩n as the initial state. . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 An example 3 qubit circuit diagram for the ZZ Feature map proposed by Havlicek et al. The circuit diagram is a combination of angle encoding and the ZZ-feature map entangling. In this case, only |S| ≤ 2 interactions are considered. . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Architecture for the encoder only transformer . . . . . . . . . . . . . 26 xi List of Figures 3.2 Matrix representation of the QKV calculation from the input embed- dings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Attention block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Review length corresponding to sentiment. No significant indication exists whether positive or negative reviews affect the sentiment. . . . 31 4.2 Training of the transformer model with the attention deactivated . . 33 4.3 Full dataset used, sequence length of 512 tokens, and ran for 20 epochs. The runtime was ≈ 43 minutes. . . . . . . . . . . . . . . . . . 35 4.4 First 5000 reviews used, sequence length of 50 tokens, and ran for 5 epochs. The run-time was 3841.8 minutes . . . . . . . . . . . . . . . . 37 4.5 The first 1000 reviews were used. A sequence length of 50 tokens, which was ran for 5 epochs. The runtime of this was 6469 minutes. . 39 4.6 First 1000 reviews used, sequence length of 50 tokens, and ran for 5 epochs. A comparison across all three models tested. The accuracies and runtimes of each model can be seen in the table 4.10 . . . . . . . 40 B.1 Dataset size: 500, Sequence Length: 30, Num Epochs: 5 . . . . . . . III B.2 Dataset size: 1000, Sequence Length: 30, Num Epochs: 5 . . . . . . . IV B.3 Dataset size: 50000, Sequence Length: 30, Num Epochs: 5 . . . . . . IV B.4 Dataset size: 500, Sequence Length: 50, Num Epochs: 5 . . . . . . . V B.5 Dataset size: 5000, Sequence Length: 50, Num Epochs: 5 . . . . . . . V B.6 Dataset size: 50000, Sequence Length: 50, Num Epochs: 5 . . . . . . V B.7 Dataset size: 500, Sequence Length: 512, Num Epochs: 20 . . . . . . VI B.8 Dataset size: 5000, Sequence Length: 512, Num Epochs: 20 . . . . . VI B.9 Dataset size: 50000, Sequence Length: 512, Num Epochs: 20 . . . . . VI B.10 Dataset size: 500, Sequence Length: 30, Num Epochs: 5 . . . . . . . VII B.11 Dataset size: 1000, Sequence Length: 30, Num Epochs: 5 . . . . . . . VII B.12 Dataset size: 50000, Sequence Length: 30, Num Epochs: 5 . . . . . . VII B.13 Dataset size: 500, Sequence Length: 50, Num Epochs: 5 . . . . . . . VIII B.14 Dataset size: 5000, Sequence Length: 50, Num Epochs: 5 . . . . . . . VIII B.15 Dataset size: 50000, Sequence Length: 50, Num Epochs: 5 . . . . . . VIII B.16 Dataset size: 500, Sequence Length: 512, Num Epochs: 20 . . . . . . IX B.17 Dataset size: 5000, Sequence Length: 512, Num Epochs: 20 . . . . . IX B.18 Dataset size: 50000, Sequence Length: 512, Num Epochs: 20 . . . . . IX B.19 Dataset size: 500, Sequence Length: 30, Num Epochs: 5 . . . . . . . X B.20 Dataset size: 1000, Sequence Length: 30, Num Epochs: 5 . . . . . . . X B.21 Dataset size: 500, Sequence Length: 50, Num Epochs: 5 . . . . . . . XI B.22 Dataset size: 1000, Sequence Length: 50, Num Epochs: 5 . . . . . . . XI B.23 Dataset size: 5000, Sequence Length: 50, Num Epochs: 5 . . . . . . . XI B.24 Dataset size: 500, Sequence Length: 30, Num Epochs: 5 . . . . . . . XII B.25 Dataset size: 1000, Sequence Length: 30, Num Epochs: 5 . . . . . . . XII B.26 Dataset size: 500, Sequence Length: 50, Num Epochs: 5 . . . . . . . XII B.27 Dataset size: 1000, Sequence Length: 50, Num Epochs: 5 . . . . . . . XIII B.28 Dataset size: 5000, Sequence Length: 50, Num Epochs: 5 . . . . . . . XIII B.29 Dataset size: 500, Sequence Length: 30, Num Epochs: 5 . . . . . . . XIV B.30 Dataset size: 1000, Sequence Length: 30, Num Epochs: 5 . . . . . . . XIV xii List of Figures B.31 Dataset size: 500, Sequence Length: 50, Num Epochs: 5 . . . . . . . XV B.32 Dataset size: 1000, Sequence Length: 50, Num Epochs: 5 . . . . . . . XV B.33 Dataset size: 500, Sequence Length: 30, Num Epochs: 5 . . . . . . . XV B.34 Dataset size: 1000, Sequence Length: 30, Num Epochs: 5 . . . . . . . XVI B.35 Dataset size: 500, Sequence Length: 50, Num Epochs: 5 . . . . . . . XVI B.36 Dataset size: 500, Sequence Length: 20, Num Epochs: 20 . . . . . . . XVII C.1 Number of reviews for each sentiment . . . . . . . . . . . . . . . . . . XIX C.2 Positve and negative reviews by length . . . . . . . . . . . . . . . . . XX C.3 Samples of reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . XX xiii List of Figures xiv List of Tables 2.1 NLP tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 A sample of 5 movie reviews from the IMDB dataset with correspond- ing sentiment. A "1" sentiment indicates a positive sentiment about the given text, and a "0" indicates a negative sentiment. . . . . . . . . 22 3.2 Tokenized input text . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Tokenized input text with stopwords removed . . . . . . . . . . . . . 24 3.4 Tokenized input text with punctuation removed . . . . . . . . . . . . 24 3.5 Tokenized input text with lemmatization applied . . . . . . . . . . . 25 3.6 Tokenized text with stemword processing. . . . . . . . . . . . . . . . 25 4.1 Hyperparameters for the attention-less transformer . . . . . . . . . . 33 4.2 Hyperparameters for the previous work on a quantum transformer . . 34 4.3 Hyperparameters for the classical transformer . . . . . . . . . . . . . 35 4.4 Accuracy and Loss results for the Classical Transformer . . . . . . . . 36 4.5 Hyperparameters for the variational quantum attention transformer . 37 4.6 Accuracy and Loss results for the Variational Quantum Circuit At- tention Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.7 Hyperparameters for the Quantum Kernel Attention Transformer . . 38 4.8 Accuracy and Loss results for the Quantum Kernel Attention Trans- former . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.9 Parameters used to compare the three model architectures . . . . . . 40 4.10 Comparison of different architectures based on accuracy and runtime after 5 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 xv List of Tables xvi 1 Introduction The development of quantum computing and quantum computers across several platforms and technologies has shown a quantum advantage using quantum simu- lation for simple artificial problems [4]. However, these problems simply show that quantum advantage is achievable, albeit the problems are not of practical impor- tance. Quantum computers today are limited to noisy intermediate-scale quantum (NISQ) devices, which provide limited connectivity and noisy qubits. The imperfections in the quantum hardware, such as temperature fluctuations, electromagnetic interfer- ence, and measurement errors contribute to this noise and can lead to decoherence. Today’s quantum computers only have a number of qubits on the order of 100’s of qubits, and this limited qubit count makes it challenging to perform large-scale problems compared to the vision for certain quantum algorithms, which makes the much anticipated algorithm like Shor’s algorithm out of reach for the time being [21]. Quantum machine learning (QML) is an emerging field at the intersection between quantum computing and classical machine learning aimed at harnessing the unique properties of quantum systems to enhance traditional data analysis techniques. As quantum computers continue to evolve and demonstrate computational power, ex- ploiting their functionality in the realm of machine learning becomes increasingly attractive. Quantum machine learning has the potential to provide an advantage in solving certain problems that classical computers cannot solve in polynomial time. 1 1. Introduction Figure 1.1: Example diagram for a hybrid machine learning model. In this figure, the classical computer does the pre and post processing of the data, in addition to adjusting the learning rates. The classical computer feeds the quantum computer with an initial state and quantum computer performs the measurements after the final state is reached and feeds that measured state back to the classical computer. The final prediction for the given prompt is then given as a result to the user. Hybrid quantum algorithms are in the class of algorithms that combine the power of classical computing and quantum computing to solve a certain problem. The classical computer handles the computations that it can perform efficiently, such as data processing, post-processing of results, and optimization of the algorithm, while the quantum computer focuses on the tasks that can benefit from the unique physical properties of quantum mechanical systems. Figure 1.1 gives a high-level overview of a hybrid model structure. When constructing a hybrid quantum algorithm, one often looks to classical algorithms to find inspiration. As can be seen in figure 1.2, an obvious similarity comparison can be made between quantum computing and kernel methods. Quantum computing and kernel methods have many similarities. As will be explained in this thesis, both methods allow vector computations in feature spaces that are potentially inaccessible otherwise [25]. This can be accomplished without the need to compute an explicit mapping of the vectors. Kernel methods for machine learning are ubiquitous for pattern recognition. One field where such pattern recognition is a necessary tool is in natural language processing (NLP). Large language models (LLMs) are sophisticated models that are trained on a vast amount of text data to acquire a deep understanding of language and can be used to generate human-like responses. These models are typically based on deep learning architectures such as recurrent neural networks (RNN’s) [17] or transformer networks [29]. Transformers are deep learning models that specialize in sequential modeling. A block diagram of the transformer architecture can be 2 1. Introduction Figure 1.2: Comparing kernel methods mapping to a feature space, and quantum computing mapping to Hilbert space. seen in Figure 1.3. The transformer has revolutionized the field of NLP due to their ability to capture long-range dependencies and parallelize computations effectively. Central to transformers is the concept of attention. Attention can be thought of as the relationship between different words in the text, or how each word attends to other words in the text. Given that human language follows patterns regardless of differences in origin, it becomes a good candidate for the hybrid model to detect patterns. We can use this natural language application as a medium to implement a hybrid quantum transformer. 1.1 Aim As alluded to in the introduction, the transformer is a powerful architecture to han- dle natural language processing problems [2] [29]. It has been previously shown that quantum kernel methods have had a big impact on supervised learning tasks and in big data [8] [22] [25]. A natural progression from these two perspectives is to marry the two fields. This brings us to the hypothesis of the thesis. Hypothesis: A hybrid quantum transformer can be implemented for Natural Lan- guage Processing using a quantum kernel as the attention. At the time of writing this thesis, there have been related work implementing a quantum enhanced trans- former for sentiment analysis which uses a variational quantum circuit for attention and as a feed forward network [27], as well as a self-attention for text classification [13], and the use of multiple techniques for the implementation of a quantum vision transformer [3]. To use a quantum kernel as the attention mechanism for the trans- former model to perform NLP Sentiment Analysis is a natural progression from the literature. We can ask many questions in pursuit of this goal. Research Question 1: Can we implement a quantum kernel as the attention mechanism for Transformer models? This research question is structured as a proof-of-concept to demonstrate the feasi- bility of integrating a quantum kernel in place of the attention of an encoder-only 3 1. Introduction Figure 1.3: A simple block diagram of the full encoder/decoder tranformer. 4 1. Introduction transformer. Research Question 2: Can the quantum kernel provide a computational speedup compared to the use of a classic self attention of the transformer model? This is a fundamental question regarding any hybrid quantum machine learning algorithm. If the quantum kernel would provide a computational speedup compared to a classical attention of the transformer model, we would be implying quantum advantage which is a large milestone in the field of quantum computing. Research Question 3: Does the training of a large language model using a quan- tum attention result in higher accuracy compared to a classic attention model? As a follow up to RQ2, alternatively to a computational speed up, we can ask if the training of the hybrid model provides greater accuracy compared to the classical model. This is another evaluation parameter which would indicate quantum advan- tage, and is a fundamental question in the field if hybrid systems can out-perform classical ones. Research Question 4: How does this implementation of a hybrid transformer with quantum kernel attention compare to previous work done using a variational quantum circuit as the attention? At the time of writing this thesis, I am only aware of one other work on a hybrid quantum transformer, notably the work done by DiSipio and colleagues [27]. His work provides a benchmark to compare my implementation of a hybrid quantum transformer. 1.2 Structure This thesis begins with an introduction to the field of natural language processing, machine learning, and quantum computing. Since the scope of the thesis combines these three fields, a fundamental understanding is given as relevant background information to rationalize and effectively interpret the results. The most basic prin- ciples of NLP are given along with necessary definitions as well as the specific tasks that can be solved, and different applications thereof. In order to understand the architecture of a transformer, a basic explanation of how a machine can learn from data and how we can leverage the kernel trick to assist in the learning by way of a feature map. Finally, the principles of quantum computing will be explained be- cause we will need to leverage quantum data encoded in a quantum circuit for our attention mechanism. I will discuss the two quantum algorithms explored in this thesis, the variational quantum circuit, and the quantum kernel estimator. After the required fundamental groundwork has been established, I will provide a framework for how to apply a hybrid quantum transformer with a quantum attention mechanism. Starting from the input dataset, I will navigate the process of how the data is prepared for the model, and then follow what is happening in the model itself, highlighting and explaining relevant features of the transformer and how they work along the way. I will first introduce the dataset, and how we break down the input strings of the dataset such that the inputs are in a more consistent and readable format for the model. In addition to the architecture of the transformer, I will explain how a quantum attention mechanism is incorporated into the architecture of the model. Lastly I will explain the different evaluation metrics and how the 5 1. Introduction model can learn from the data. The evaluation chapter will dive into the performance of the model and how it compares to previous work on transformers for NLP. The first evaluation I will benchmark is a transformer without attention. This architecture without attention is a multi-layer perceptron (MLP) with positional encoding and token embeddings. I will then compare the classical transformer architecture against two quantum vari- ations: the quantum transformer with a variational quantum circuit (VQC) and my implementation of the quantum attention using a quantum kernel. The primary metrics used for comparison are accuracy and loss. Because of the lack of scalability of the quantum models at this stage, we will look into hyper parameter tuning, the size of the dataset used and the impact on performance of the model for evaluating sentiment. By comparing these metrics across the classical transformer, quantum transformer with VQC, and hybrid transformer with quantum attention, we gain in- sights into the relative performance and potential benefits of incorporating quantum elements into the Transformer architecture. Finally, the conclusion will summarize the results. I will also outline the limitations and challenges faced throughout this work. Throughout the thesis, the scope and scale of the project had to be revised given the complexity of the task. In the con- clusion, I will highlight these challenges and future directions that can be explored that will further investigate the potential of hybrid quantum transformers and how we can continue to improve the performance. 6 2 Background Natural Language processing can be thought of as the intersection of computer sci- ence and linguistics. Quantum computing is the intersection of quantum physics, computer science and software engineering. Moving more technical, large language models like BERT, GPT, and their successors are the intersection of machine learn- ing and natural language processing. This section will break down these individual parts and provide the necessary information to understand the implementation of my hybrid quantum transformer. 2.1 Natural Language Processing Natural Language Processing (NLP) is an interdisciplinary sub-field comprised from linguistics, computer science, and artificial intelligence, in which a machine can process, understand, and use the human language. As a generalization, Natural Language Processing can be broken down into two subfields: Natural Language Understanding (NLU), and Natural Language Generation (NLG). NLU focuses on the machine’s ability to understand, process, and interpret the human language, while NLG focus on text generation. In both cases, the machine needs to break down the input text, understand the meaning of the input, and give an appropriate response, whether the response is a sentiment, text summary, translation, etc. Figure 2.1: Example response from ChatGPT (GPT-3). This is an example of question/answer, a sub-category of natural language processing. 7 2. Background NLP has been making waves through the media recently for advancement in Large Language Models (LLMs), which use large datasets to pre-train their models, a computationally expensive task, and deploy the model while continuously fine-tuning based on feedback the model is given by users. One mainstream example of this is ChatGPT, a generative-pre-trained (GPT 4.0) chatbot which OpenAI deployed in March 2023 [19]. An example use case of the chatbot is shown in figure 2.1. NLP can be used to accomplish a variety of tasks concerning the human language. The following are a few examples of common NLP tasks. 8 2. Background Table 2.1: NLP tasks NLP Task Description Example Text Result Text Classifica- tion Categorizing emails as spam or not spam "Congratulations! You’ve won a free trip to Hawaii. Click here to claim your prize!" Spam Sentiment Anal- ysis Determining sen- timent polarity of customer reviews "The movie was fan- tastic. I loved every moment of it!" Positive Named Entity Recognition Identifying names of people, organizations, and locations in a text "John Smith works at Acme Corp in New York City." Person: John Smith, Organi- zation: Acme Corp, Location: New York City Machine Trans- lation Translating a sen- tence from English to French "Hello, how are you?" (English) "Bonjour, com- ment ça va ?" (French) Question An- swering Providing answers based on a given text and questions "Q: What is the capi- tal city of France?" Paris Text Summa- rization Generating a concise summary of a long ar- ticle or document "Text: "The ad- vancements in ar- tificial intelligence and machine learn- ing...transform many aspects of our lives."" ’Artificial in- telligence and machine learn- ing advance- ments have transformed healthcare, autonomous vehicles, finance, and entertain- ment industries, enabling early disease detec- tion, self-driving capabilities, fraud detection, personalized content rec- ommendations, and improved outcomes.’ Speech Recogni- tion Converting spoken language into written text Spoken phrase: "Nav- igate to the nearest coffee shop." "Navigate to the nearest coffee shop." Language Gen- eration Generating human- like text, such as dialogue or stories "Once upon a time, in a land far away..." Once upon a time, in a land far away... 9 2. Background Text Classification Text Classification is the technique in NLP which categorizes a given text into predefined categories or groups based on its context or meaning. Text classification such as spam filtering is one example in which classification of text data is used in our every day lives within email servers. Sentiment Analysis Sentiment Analysis is a part of NLU, in which the goal of the model is to identify and extract subjective information about the input text. These sentiments can be positive or negative, or be assigned a value, for example 1 − 5, with 1 being most negative, and 5 most positive. Companies might use these models for social media monitoring, ranking product reviews, or threat detection. Information Retrieval Used to extract information from a document or text. We can use this for plagiarism scanning etc. Question/Answering Question and Answering is a major subsection in NLP which enables a machine to understand and respond to questions formed in natural language. As an extension to information retrieval, the machine can extract the relevant information from a given set of documents or knowledge base and provide a response to the user input. The question first needs to be analyzed to understand structure, meaning, and the information requested. The machine then summarizes the question, by identifying keywords, and determining which type of question is being posed (factual, opinion- based, list-based, etc). Then the formulated answer needs to be retrieved from the set of documents or knowledge base. This is accomplished via a document search via keyword matching, semantic similarity etc. Summarization Text summarization is the process of collecting a concise and coherent summary of a longer text. This can involve information retrieval to extract the most important information and the main ideas from the original text while maintaining the meaning and context. Summarization can be used when dealing with information overload, which allows the user to quickly grasp the key ideas from a publication or text. Machine Translation Machine Translation is the process of automatically translating from one language to another. There are two main approaches to machine translation. The first is Rule-based Machine Translation (RBMT). Rule based relies on linguistic rules and dictionaries to translate text. The machine would then follow the rules which specify how to transform the sentence structure, words or phrases from the source language to the target language. The other approach is statistical machine translation (SMT). 10 2. Background 2.1.1 Applications Natural language processing can be applied almost anywhere a form of communica- tion is present. Primitive forms of natural language processing came in the forms of Google Assistant, Amazon Alexa, or customer support chatbot with question/an- swer type responses. 2.2 Machine Learning Machine learning in a general sense, is a field of computer science which uses statis- tics to learn from input data, without being explicitly programmed. The given dataset is split into two subsets, a training set and a test set. As can be inferred from the name, a training set is the data in which the machine learning model is trained on, and the test set is the set of data that the model is evaluated. We can subdivide the training in to three major classes of training, supervised, unsupervised and reinforcement learning. In supervised learning, the algorithm learns to map the input data to output data based on specified labels provided in the dataset. The goal is for the model is to learn the correct patterns on the input data which result in an output label, such that it can make accurate predictions on new, unseen data. During the training, the algorithm tries to minimize the error between the predicted output and the true output values. Alternatively, unsupervised learning does not have the available labels to learn from, so the algorithm must learn from underlying patterns or relationships found between the data points. Throughout training, typically only two parameters are of concern. In this thesis, the model will try to minimize the loss and maximize accuracy on a labeled dataset, with the labels being either 1 or 0. If the transformer produces an output that is consistent with the provided label, it will return a 1, and a 0 if the output does not match the label. Further explanation of the evaluation of the model will be described in the following chapter. 2.2.1 Kernel Theory Kernel theory is an important concept in machine learning which allows for analysis of complex functions, particularly non-linear and non-paramaterizable models. To understand the fundamentals of kernel theory, we can revisit a simple problem of binary classification. If we have a set of datapoints, we need to construct a classifier to predict the class of the data points. A simple linear classifier is one such a line in two dimensions or features (x and y), or hyperplane in higher dimensions, can divide the dataset in to two groups, and assigning labels to each of the two groups on opposite sides of the line. A basic linear binary classifier can be visualized in figure 2.2. Algorithms like the support vector machines (SVM) make use of the kernel trick for the computation of the dot product in high dimension feature space, with the goal to maximize the margin of separation between classes of data [23]. The other common kernel algorithm is the principal component analysis (PCA), which can be thought of as distance algorithms for finding norm based distances between 11 2. Background Figure 2.2: Basic binary classifier, which categorizes a two featured input data in to one of two classes. It can be seen that the green linear line separates the two classes and the classes are subsequently labeled as Class 0 or Class 1 depending on where they fall above or below the classification line. inputs in Hilbert space. The kernel function is a similarity measure between the data points in the input space. Classical kernel functions include the linear kernel, polynomial kernel, gaus- sian kernel, and sigmoid kernel, among others [9]. Each kernel function is suitable for different types of data and problems. All kernels however, result in the squared inner product of the data points in a feature space. This mapping to the feature space is performed by a feature map. The kernel trick is a crucial idea in kernel theory which allows for the computation in high-dimensional feature space be cast in terms of dot product, without explicitly computing the feature mapping. This allow for the kernel function (κ) to operate in the input space similar to if they were operating in the high dimensional feature space F . Formally, input data, x, is mapped from the data space, χ, to the feature space, F , using a feature map, ϕ: χ → F , x → ϕ(x) [26]. k(x, x′) = ⟨ϕ(x), ϕ(x′)⟩ (2.1) 12 2. Background 2.3 Quantum Computing In the previous section, we explored the ideas of classical machine learning and ker- nel methods specifically. In this section, we are going to talk about a more nuanced and complex form of computing: quantum computing. Quantum computing is a unique form of computing in which qubits are used instead of bits, and quantum algorithms replace their classical counterparts to perform operations. To explain the fundamentals of quantum computing, we can first begin to under- stand how classical computing works. It is then possible to draw comparisons from which we can base our understanding. Classical computers run on bits (short for binary digit), which can take the form of 0 or 1. Bits represent a logical state of two possible values to store information. These can take the representation of true/false, yes/no, +/−, or simply 0 or 1. Multiple bits can be combined to form bytes, which can represent larger pieces of information, including integers, floating point num- bers, or characters as text strings. Bits can also be used to represent instructions to operate a machine. A computational algorithm is a sequence of these instructions to instruct the computer to perform a specific task by manipulating the states of these bits. Quantum computing can be thought of as the manipulation of quantum infor- mation to perform a specific task by manipulating the quantum states. There are many different approaches to build a quantum computer for quantum information processing, including superconducting qubits [12], trapped ion quantum computing [1], and photonic quantum computing [7], among others. All of which must satisfy DiVincenzo’s criteria for the characteristics of a quantum computer [5]. These have become the main guideline and standard for building a quantum computer. • A scalable physical system with well characterized qubits • The ability to initialize the state of the qubits to a simple fiducial or reference state (such as |000⟩) • Long relevant coherence times, much longer than the gate operation time • A ’universal’ set of quantum gates • A qubit-specific measurement capability 2.3.1 Qubits In quantum computing, information is represented by qubits, or quantum bits. In- stead of representing 0 and 1 like a classical bit, a quantum bit is represented as a linear combination of states, which is often described as a superposition of a ground state |0⟩, and excited state |1⟩, and can be mathematically described by the state ψ as: |ψ⟩ = α |0⟩ + β |1⟩ = ( α β ) (2.2) The numbers α and β are complex numbers. The states |0⟩ and |1⟩ are special states and form an orthonormal basis for the vector space, and are known as the computational basis states [18]. If we have N qubit in our system, the system can be in 2N different states. The state of an N -qubit system can be described by the tensor product of states, for example the 2-qubit state is given as |00⟩ ≡ |0⟩⊗|0⟩. A classical 13 2. Background Figure 2.3: The Bloch sphere is a geometric representation of the pure state space of a two-level quantum mechanical system (qubit). Each point on the surface of the sphere corresponds to a unique quantum state. The north and south poles typically represent the basis states |0〉 and |1〉, respectively, while any other point represents a superposition of these states. system with N bits can be in one of these 2N states, where our quantum system can be in a superposition of all 2N states. When we perform a measurement on the qubit, we cannot explicitly examine the quantum state, rather sample properties of the system. A measurement will collapse the state into either the |0⟩ state with probability |α|2, or |1⟩ state with probability |β|2. Naturally, all of the probabilities must sum to one, i.e. |α|2 + |β|2 = 1. Geometrically, we can rewrite equation 2.2 as: |ψ⟩ = cos ( θ 2 ) |0⟩ + eiφ sin ( θ 2 ) |1⟩ (2.3) The angles θ and φ correspond to angles on the Bloch Sphere which can be visualized in figure 2.3. The operations we perform on the qubits to manipulate the state of the quantum system can be seen as, for a single qubit, rotations around the Bloch Sphere. As we will see in the following section, we can define single qubit gates as specific operations to rotate the qubit around the Bloch sphere. We can the string together sequences of these operations on one or more qubits to form our quantum algorithm. 2.3.2 Qubit Gates The composition of quantum algorithms begins with the formalism of quantum gates. Operations on a single qubit can be described by a 2x2 unitary matrix. The most common are the Pauli matrices: X ≡ ( 0 1 1 0 ) ; Y ≡ ( 0 −i i 0 ) ; Z ≡ ( 1 0 0 −1 ) (2.4) By combining the X and Z gates, we obtain the Hadamard gate, which transforms the qubit state from the |0⟩ , |1⟩ basis to the |+⟩ , |−⟩ basis, and vice versa. H = X + Z√ 2 = 1√ 2 ( 1 1 1 −1 ) (2.5) If we were to make a measurement on the |+⟩ state, we would observe the |0⟩ state with a probability of 1/2, and we would observe the |1⟩ state with a probability of 14 2. Background 1/2. By adding the Hadamard gate to the circuit, we have created a superposition state. In order to do useful things with quantum algorithms which a classical algo- rithm could not, we need to create entanglement or interactions between two qubits through multi-qubit gates. Entanglement can be described as a highly correlated interaction between two particles. Qubits are entangled if the state of the system cannot be described as a product of two individual qubits, but must be considered as a whole [6]. We can apply a controlled operation of a second qubit depending on the state of the first. Typical application is a Controlled-NOT (CNOT or C-X) or controlled-Z (CZ) gate, where we can do nothing to the second qubit if the first in state |0⟩, and apply the unitary to the second if the first is in state |1⟩. CU = ( I2 02 02 U ) (2.6) Where U is the unitary matrix for a single qubit. For example, we can look at a CNOT gate, if the first qubit is in the state |0⟩, (|00⟩ , |01⟩ remains unchanged to |00⟩ , |01⟩), and would apply a NOT (X) gate if the first qubit is in the state |1⟩, (|10⟩ , |11⟩ changes to |11⟩ , |10⟩). In matrix form, this CNOT gate is represented by: CNOT =  1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0  (2.7) The other two qubit gate that we will pay attention to at this stage is the SWAP gate, which is not a controlled gate, but also creates entanglement. This gate will swap the states of the two qubits: SWAP =  1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1  (2.8) Physical limitations of many quantum computational systems allows for limited nearest neighbor connectivity, which prohibits direct implementations of multi qubit gates of order three or more. However qubit gates of three or more can be decom- posed into single and two-qubit gates [6]. Furthermore, universal quantum compu- tation is achievable if the gate set can approximate any unitary transformation on an arbitrary number of qubits to any desired level of precision [6]. By defining a universal gate set, we can now perform universal quantum computation with which we now can construct quantum algorithms to accomplish a given task. With quan- tum computing still in its infancy, we are left with noisy-intermediate-scale quantum (NISQ) devices. Given current state of quantum devices, quantum algorithms are being developed via simulation in which programs can efficiently simulate quantum systems on classical computers. The Gottesman-Knill theorem states that only a quantum circuit composed of the following elements can be simulated on a classical computer: • Preparation of the qubits in the computational basis 15 2. Background • Gates from the Clifford group (H, CNOT, S) and • Measurements in the computational basis Restricting our quantum circuits to the requirements above, we can make the circuit classically simulatable even though it contains superposition, entanglement, and complex amplitudes [6]. However, in order to work with these systems, we need to work with quantum data. 2.3.3 Quantum Data In order to work with classical data on a quantum computer, it needs to be prepared in a form which can be read by a quantum circuit. There are different approaches to data encoding and each has their respective applications, advantages and disad- vantages. Basis Encoding Basis encoding encodes a classical N -bit string into an N -qubit state. The binary representation of the input x is given as x := bn−1bn−2...b1b0 is encoded into the N - qubit state given by x := |bn−1bn−2...b1b0⟩ [31]. Formally the binary transformation of state x is given by: X ≈ m∑ i=−k bi2i 7→ |bm . . . b−k⟩ (2.9) The state is encoded in a specific basis, typically the computational basis, or the eigenbasis of a specific observable. To encode the data in this basis, one can apply a Pauli-X, or Pauli-Z gates to the qubits. These gates will manipulate the individual qubits and transform the state according to the desired basis encoding defined by the problem at hand. For example, if we want to encode the classical string "10", we need to assign the appropriate probability amplitudes to the basis states corresponding to the classical bits. The basis states of the 2-qubit system are: |00⟩, |01⟩, |10⟩, |11⟩. The classic bit string tells us we want to have the first qubit in the |1⟩ state, and the second qubit in the |0⟩ state. Therefore all other amplitudes (i.e. |00⟩, |01⟩, |11⟩) will be zero. This however identifies a drawback of using basis encoding. We will need a large amount of qubits to represent the data, given the binary representation of numbers. As the values for our vectors grow, the circuit will become very sparse [6]. Amplitude Encoding Amplitude embedding is another way to encode classical data into a quantum state. Every quantum system can be described by a wave function ψ which also defines the measurement probabilities. We use the amplitude of the wave function to encode the data. Therefore the amplitudes of the quantum system are used to represent the data values. The advantage of amplitude encoding is that it only requires log2(n) qubits to encode. For example if we have a vector x = (x0, x1, ..., xn)T , we can 16 2. Background normalize to length 1 and encode using: |ψ⟩ = n−1∑ i=0 xi |i⟩ (2.10) If we have data vectors of different lengths, we need to ensure the lengths of the data to be encoded is a power of 2, so padding to length is necessary. Amplitude encoding seems like a promising and simple way to represent data, however the storage of such data requires QRAM or quantum random access memory, and to date, we are unsure how to build such a memory device [6]. Angle Encoding To map the data to a given angle encoding, we need to normalize the data over the interval [0, 2π). Each value of the vector xi = (x0, x1, ..., xn)T can be encoded using a Rotation gate (R(ϑ)x,y,orz), where the rotation angle allows for unique encoding by a specified angle [30]. We leverage this to encode the data by rotations around the Bloch sphere by a relative phase. The angle encoding is efficient in terms of the number of qubits needed to encode the data. In theory, one could encode a rotational gate to infinite precision, however limitations on the accuracy of the hardware prevent practical implementation of such precision. 2.4 Quantum Machine Learning Having outlined the components of a quantum circuit model to encode data into a quantum circuit, the question now becomes what can we do with classical data in a quantum circuit. Quantum computers have the potential to provide a com- putational speed up compared to their classical counterpart, but are still to early in the development for practical use. Quantum computers are still in the noisy in- termediate scale quantum computer (NISQ) level, which errors and limited qubit connectivity limit application toward complex problems [21]. Quantum computing on the other hand, deals with quantum algorithms, which can be simulated on a classical machine. Quantum algorithms have given rise to quantum machine learn- ing (QML), a sub-field in which a quantum model can learn from data. Machine learning and quantum computing have the potential to solve previously untenable problems. Of these algorithms, this thesis will explore the application of Variational Quantum Algorithms (VQA), and Quantum Kernel methods. 2.4.1 Variational Quantum Circuit The general concept of a variational quantum circuit (VQC) is to vary the parame- ters of the circuit and iteratively vary the parameters of the circuit until an objective function has been minimized such as the mean square error (MSE) between the out- put and the labels. VQA’s are a class of hybrid quantum models which the quantum computer or simulator will encode the data into a variational circuit with a certain set of parameters and efficiently estimate the cost function (or it’s gradients). The weights are trained as tunable rotations (θ) for each qubit embedding. Schuld et 17 2. Background Figure 2.4: Generic block diagram for a variational circuit. The weights are trained as tunable rotations for each qubit embedding, and we can stack these circuits to increase the capability of the model. al. proposed these circuits with multiple layers to increase the model’s capabilities [25]. This information is then fed into a classical optimizer, and the optimizer will navigate the cost landscape and solve the optimization problem. Once a termination condition is set, the VQA will output the estimate of the solution. 2.4.2 Quantum Kernel Methods As mentioned in section 2.1.1 on kernel methods, the kernel trick allows for high dimensional data to be computed in the lower dimensional space. We can use these fundamentals in the quantum space as well. The states of the quantum systems are vectors in high dimensional Hilbert Space. The Hilbert space of an n-qubit quantum computer is the vector space CN where N = 2n. The quantum embedding kernel is simply the overlap of the two quantum states in the Hilbert space. One way to evaluate relationships between data is to evaluate the similarity between the two datapoints. Conveniently, the inner product of two quantum states is just that. κ (x, x′) = 〈 0 . . . 0 ∣∣∣Vϕ(x)V † ϕ (x′) ∣∣∣ 0 . . . 0〉 (2.11) The benefit of kernel methods is to provide a feature map from an high dimensional Hilbert Space to a reduced dimensional feature space for which we can linearly classify data. As discussed in the introduction, a quantum kernel can be used to provide a com- putational speedup compared to a classical kernel. However, we need to implement a feature map that is hard to simulate classically. Havlicek et al. describe a feature map (figure 2.5) generated by the unitary, Uϕ(x), of such a mapping on n-qubits[8]. Uϕ(x) = Uϕ(x)H ⊗nUϕ(x)H ⊗n Uϕ(x) = exp i ∑ S⊆[n] ϕS(x) ∏ i∈S Zi  (2.12) This feature map is known as the ZZ-feature map. This feature map is named after the Pauli-Z tensor product of quantum states which embeds classical data into the quantum feature space. The qubits are encoded with a Hadamard and the rotational z-gate, and are entangled in the section with the ZZ gates. The circuit diagram for the ZZ feature map can be seen in figure 2.6. 18 2. Background Figure 2.5: The unitary, Uϕ(x), is composed of the conventional Hadamard, and Uϕ(x), which is diagonal in the Pauli Z-basis. The circuit will act on |0⟩n as the initial state. Figure 2.6: An example 3 qubit circuit diagram for the ZZ Feature map proposed by Havlicek et al. The circuit diagram is a combination of angle encoding and the ZZ-feature map entangling. In this case, only |S| ≤ 2 interactions are considered. Once the classical data is mapped to the quantum space, we can then perform the inner product of these states to get a similarity measure between data points. K (x, y) = |⟨ϕ(x)|ϕ(y)⟩|2 (2.13) By embedding the data into a quantum feature space, we can use a parameterized quantum circuit to map the data point into the quantum state [10]. The kernel value is the overlap of the parameterized quantum states. The mutual kernel values from all data-points in the set form the kernel matrix. In our case, the constructed hyper- plane to classify the data is purely classical. Only the kernel function is evaluated on the quantum computer. 19 2. Background 20 3 Framework With the background information necessary for understanding the context of this thesis, I will now elaborate how the research questions will be addressed. I will begin by explaining the dataset, and how it’s used for sentiment analysis, and then dissect the transformer both classically and how it’s migrated to a hybrid architecture. The overall architecture can be found in figure 1.3, which was first described by Vaswani et al in 2017. Due to the fact that most machine learning libraries are written in Python, along with previous work [27] [24], the natural choice of language to navigate this thesis was Python. To manage the loading and manipulation of the dataset, Pytorch, an open source Deep Learning framework for building and training neural network architectures, was used for tensor computations and the handling of gradient based training. To accomplish the Natural Language Processing task of Sentiment Analysis, we are using the IMDB movie review dataset for this task. The dataset is composed of 50, 000 highly polar movie reviews, compiled from the Internet Movie Database (IMDB) along with their associated sentiment, either positive or negative. This dataset was compiled by Andrew Mass et al at Stanford [15]. Because the actual movie reviews are considerably long (> 500 words), five different simplified text examples can be seen in the table 3.1. The written review is formatted as an input string in English text form. The labels of the dataset are binary 0, for a negative review, and 1 for a positive review, and are listed under the heading "Sentiment" in the table. 21 3. Framework Table 3.1: A sample of 5 movie reviews from the IMDB dataset with corresponding sentiment. A "1" sentiment indicates a positive sentiment about the given text, and a "0" indicates a negative sentiment. Review Sentiment "This movie was absolutely fantastic! The acting was superb, and the plot kept me on the edge of my seat throughout." 1 "I found the movie to be quite disappointing. The story was con- fusing, and the acting was subpar." 0 "A heartwarming film that left me with a smile on my face. The characters were well-developed, and the cinematography was stun- ning." 1 "I couldn’t stand this movie. The dialogue was poorly written, and the performances were uninspiring." 0 "This is a must-see movie! The plot twists were mind-blowing, and the acting was top-notch." 1 We will be training our dataset on the movie reviews with known labels and testing on new reviews without labels to predict sentiment of the text. The output of the model can be deployed in a variety of areas as talked about in the background section. 3.1 Data Pre-Processing As with all machine learning tasks, we need to prepare the input data such that is in a clean and readable format for the model. In addition to checking the datatset for null values, incomplete data, entry errors, etc, we use a variety of pre-processing steps which are specific for sentiment analysis string format data. Before we can pre-process the text, we first must convert the input text to tokens. Tokenization helps to split the text into smaller units, which allows for easier analysis and pro- cessing. Common tokenization methods include splitting the text on whitespace, punctuation marks, or a more advanced method with more complex language mod- els involves splitting on word segments. Continuing with the sample data we can visualize how we split the review text into tokens. 22 3. Framework Table 3.2: Tokenized input text Review Sentiment "["This", "movie", "was", "absolutely", "fantastic", "!", "The", "act- ing", "was", "superb", ",", "and", "the", "plot", "kept", "me", "on", "the", "edge", "of", "my", "seat", "throughout", "."]." 1 ["I", "found", "the", "movie", "to", "be", "quite", "disappointing", ".", "The", "story", "was", "confusing", ",", "and", "the", "acting", "was", "subpar", "."] 0 ["A", "heartwarming", "film", "that", "left", "me", "with", "a", "smile", "on", "my", "face", ".", "The", "characters", "were", "well-developed", ",", "and", "the", "cinematography", "was", "stunning", "."] 1 "I couldn’t stand this movie. The dialogue was poorly written, and the performances were uninspiring."["I", "couldn’t", "stand", "this", "movie", ".", "The", "dialogue", "was", "poorly", "written", ",", "and", "the", "performances", "were", "uninspiring", "."] 0 ["This", "is", "a", "must-see", "movie", "!", "The", "plot", "twists", "were", "mind-blowing", ",", "and", "the", "acting", "was", "top- notch", "."] 1 The first text pre-processing step is to remove numbers. As we do not expect numbers to pay a significant role in the overall sentiment of the text, we can simply remove them. As there are no numbers in the previous examples, a sample text which the function to remove number is applied can be seen as: "There are 3 balls in the bag, and 12 in the other one" which becomes: "There are balls in the bag, and in the other one". This simply removes the inputs in the string which are not part of English alphabet. After processing this for removing numbers, we need to rejoin the string such that there are no extra spaces. Stopwords are commonly used words in a language that do not carry significant meaning or sentiment. Examples of stopwords in English include "the," "is," "and," "but," and "in." These words can occur frequently in a text document, but they may not contribute much to the sentiment expressed in the text. By removing stopwords, we can reduce noise and focus on the more meaningful words that carry sentiment information. 23 3. Framework Table 3.3: Tokenized input text with stopwords removed Review Sentiment ["movie", "absolutely", "fantastic", "!", "acting", "superb", ",", "plot", "kept", "edge", "seat", "throughout", "."] 1 ["found", "movie", "quite", "disappointing", ".", "story", "confusing", ",", "acting", "subpar", "."] 0 ["heartwarming", "film", "left", "smile", "face", ".", "characters", "well-developed", ",", "cinematography", "stunning", "."] 1 ["couldn’t", "stand", "movie", ".", "dialogue", "poorly", "written", ",", "performances", "uninspiring", "."] 0 ["must-see", "movie", "!", "plot", "twists", "mind-blowing", ",", "act- ing", "top-notch", "."] 1 As we look through the examples, it can be seen that the meaning and overall sentiment of the review remains. Punctuation marks such as periods, question marks, and exclamation marks help de- fine the structure of a sentence. They indicate the beginning and end of a sentence, and they help differentiate between declarative statements, interrogative sentences, and exclamatory expressions. Without proper punctuation, sentences can become ambiguous or difficult to understand. Although punctuation provides structure to our sentences and can help clarify relation between text, the primary reason to re- move punctuation is to focus on the words themselves and concentrate on the lexical content and sentiment of the words. By removing punctuation, the text becomes a sequence of words without the interruptions or breaks introduced by punctuation marks. This streamlined text allows our model to focus on the sentiment of the words, their relationships, and their overall impact on the classification. Table 3.4: Tokenized input text with punctuation removed Review Sentiment ["movie", "absolutely", "fantastic", "acting", "superb", "plot", "kept", "edge", "seat", "throughout"] 1 ["found", "movie", "quite", "disappointing", "story", "confusing", "acting", "subpar"] 0 ["heartwarming", "film", "left", "smile", "face", "characters", "well- developed", "cinematography", "stunning"] 1 ["couldn’t", "stand", "movie", "dialogue", "poorly", "written", "per- formances", "uninspiring"] 0 ["must-see", "movie", "plot", "twists", "mind-blowing", "acting", "top-notch"] 1 Lemmatization is the process of reducing words to their dictionary form, or lemma. The Natural Language Tool Kit (NLTK) is a python library for Natural language processing tasks, including lemmatization. Lemmatization first employs another NLP tool, Part-of-Speech (POS) tagging, to determine the grammatical category of 24 3. Framework each word in the text (noun, verb, adjective, etc). The lemmatization tool then can use this POS tag and returns the lemma of the word. The lemmatization is provided by WordNet [28], a lexical database in English which groups each grammatical cat- egory into sets of cognitive synonyms (synsets). Table 3.5: Tokenized input text with lemmatization applied Review Sentiment [’movie’, ’absolutely’, ’fantastic’, ’act’, ’superb’, ’plot’, ’keep’, ’edge’, ’seat’, ’throughout’] 1 [’find’, ’movie’, ’quite’, ’disappoint’, ’story’, ’confuse’, ’act’, ’sub- par’] 0 [’heartwarming’, ’film’, ’leave’, ’smile’, ’face’, ’character’, ’well- developed’, ’cinematography’, ’stun’] 1 [’could’, "n’t", ’stand’, ’movie’, ’dialogue’, ’poorly’, ’write’, ’perfor- mances’, ’uninspiring’] 0 [’must-see’, ’movie’, ’plot’, ’twist’, ’mind-blowing’, ’act’, ’top- notch’] 1 For example, we see that the words "written" and "acting" were changed to "write" and "act", thus simplifying each word. Lemmatization ensures that different inflec- tions of a word are mapped to a single form. Another method to reduce words into a simpler form is called stemming. Stemming is a faster, but less accurate way to truncate the word into a simpler form without considering meaning or context. The truncated words might not always be a valid form of the word, but the processing time of finding the stemwords is significantly faster than performing lemmatization. To begin, the model used for this project used stemwords for efficiency, as the processing time for providing the lemmas for all 30k input reviews in the training set was 1.7s, whereas the processing time for stemming was 0.5s. Table 3.6: Tokenized text with stemword processing. Review Sentiment [’movi’, ’absolut’, ’fantast’, ’act’, ’superb’, ’plot’, ’kept’, ’edg’, ’seat’, ’throughout’] 1 [’found’, ’movi’, ’quit’, ’disappoint’, ’stori’, ’confus’, ’act’, ’subpar’] 0 [’heartwarm’, ’film’, ’left’, ’smile’, ’face’, ’charact’, ’well- develop’, ’cinematographi’, ’stun’] 1 [’could’, "n’t", ’stand’, ’movi’, ’dialogu’, ’poorli’, ’writ- ten’, ’perform’, ’uninspir’] 0 [’must-se’, ’movi’, ’plot’, ’twist’, ’mind-blow’, ’act’, ’top- notch’] 1 25 3. Framework Figure 3.1: Architecture for the encoder only transformer We can see a number of inaccuracies in the words produced after stemming. As we will see, these words will be mapped to a dictionary where the model will learn the meaning of these words and their relations to each other. So the relationship between stemwords should not be affected greatly by the stemming. However, if we provide further processing of the text by lemmatization, we can expect to see improvement in the model performance. After the input tokens have been processed, they need to be stored in a dictionary, which we call the open vocabulary dictionary. This vocabulary contains all of the unique tokens that appear in the training set. The size of this dictionary is a hyperparameter to be explored, but for the purposes of this thesis, we will fix the size to be 20, 000 unique tokens. 3.2 Classic Transformer Transformers have revolutionized the field of natural language processing and were introduced in 2017 in the paper "Attention is all you need" [29]. Most modern NLP models are based on a transformer architecture, and have been applied to a wide range of NLP tasks. A survey of such use cases was written by Yang et al. [33]. The architecture for the encoder only transformer, as first introduced in [29] is shown in figure 3.1, contrary to the encoder/decoder transformer shown in the introduction. Prior to the transformer, recurrent models and convolutional neural networks pro- cessed input sequences in sequential manner, or word by word. As sequence lengths 26 3. Framework get longer, sequential models might lose some long-range dependency. In other words, if there is a word at the end of the sequence which relies on the input at the beginning of the sequence, the model might forget which word was at the be- ginning by the time it gets to the later word. Transformers on the other hand, have the possibility to process the entire sentence in parallel, making them much more efficient. The transformer is a sequence agnostic model, in which relations between tokens is handled by the Multi-Head Attention as will be described in section 3.2.3. This Multi Headed attention is packaged in what is called the Encoder and Decoder stacks. This architecture has been used in recurrent models, however the Attention is where the main differences lie. The encoder will process the input text, while the decoder will produce the output text. The following sections will break down the components of the transformer in more detail. 3.2.1 Positional Encoding and Word Embeddings To capture the meaning of the words, the individual tokens are vectorized. The word vectors are then assigned to a random vector with the size of the embedding dimension. The embedding dimension is a hyper-parameter which can be optimized. During the training process, the embeddings are updated based on the objective of the model and patterns in the data. How one initially embeds the data can be random, by pretrained word embeddings such as Word2Vec [16], GloVe [20], or fastText [11], or by contextual word embeddings which not only capture the meaning of the individual words, but also their context within the sentence or document. For simplicity, our model will use randomly assigned embedding vectors. In sequential data, such as input text, the order of the words conveys important information. Transformers are order agnostic, as they process the entire sequence in parallel, so the concept of word order is lost. To combat this, Vaswani et al proposed a clever solution to incorporate sequential position information into the transformer. They proposed a set of learned sinusoidal functions to encode the position information. The positional encoding is then added to the word embeddings. PE(pos,2i) = sin (pos/100002i/dmodel) (3.1) PE(pos,2i+1) = cos (pos/100002i/dmodel) (3.2) where pos is the position in the sequence, and i is the dimension. By using a large enough dimension (dmodel), the positional encoding can distinguish between closely spaced positions effectively [29]. 3.2.2 Encoder The embedded text information is then passed on to the encoder stack. This stack is composed of two sub-layers which have a residual connection followed by a layer normalization. The two main sublayers are a Multi-Headed Attention layer, and a simple fully connected feed forward neural network. For NLG tasks, transformer architecture employs the use of both an encoder and decoder stack, however for sentiment analysis, we do not need to utilize the future position dependent output, so we will only use an encoder layer for the attention. To increase the complexity of 27 3. Framework Figure 3.2: Matrix representation of the QKV calculation from the input embed- dings. the transformer, and to pick up new information after each pass, the encoder (and decoder layer if applicable), can be repeated N number of times. 3.2.3 Attention At a high level, the attention can be understood as a similarity measure between elements of the input sequence, where the weights or similarities are determined by weighted combinations of the elements. The attention can be broken down even further to the query, key, and value matrices. The keys (K) is computed by a linear transformation of the word embeddings. The key captures the knowledge or information of each word and it’s relevance to the other words in the sequence. The query (Q) vector is another linear transformation of the word embeddings. The query vector represents the word for which the attention scores are being calculated. The value matrix carries the actual information of each word and is obtained by another linear transformation of the word embeddings. These vectors are obtained by multiplying the input vector by a weight matrix W , with the corresponding subscript WQ,WK ,WV . Vaswani et al described this approach as the scaled dot product attention, during which the query vector of a word is compared with the key vectors of all other words in the sequence to determine their similarity or relevance. The dot product of the two vectors captures the similarity between the two. For high dimensionality, the dot products can grow very large in magnitude, so the dot product is thus scaled by 1√ dk . Attention(Q,K, V ) = softmax(QK T √ dk )V (3.3) Multi-Headed Attention therefore is simply the self attention mechanism performed multiple times in parallel. This parallel attention allows the model to capture dif- ferent kinds of relationships. Each attention head attends to different parts of the 28 3. Framework Figure 3.3: Attention block diagram input sequence which allows for various dependencies to be captured at different levels of granularity. Each of the projected heads is then concatenated and a final output with dimension of dv is projected on to the feed forward network. 3.2.4 Feed Forward Network The classical feed forward network consists of two linear transformations with non- linear activation functions in between, typically Rectified Linear Unit (ReLU) acti- vation. The inner layer has a hidden dimension of FFN(x) = max(0, xW1 + b1)W2 + b2 (3.4) Where W1 and W2 are learned weights. The output of this layer has the same dimensionality of the model, which is then given to a final Linear transformation and softmax function to give the final output classification. 3.3 Hybrid Transformer The hybrid transformer is a member of the class of algorithms which are deemed hy- brid quantum algorithms. These algortihms leverage the classical optimizers ability to pre and post-process the data, perform the optimization over the model param- eters, and defines the quantum circuit. 3.3.1 Simple Quantum Attention Quantum attention refers to the quantum kernel embedding in the Hilbert Space. We take the tokens for each of the sequences, and perform quantum attention on 29 3. Framework the relation to all other words in the sequence. This attention, provides the inner product to provide a similarity measure between tokens. This similarity measure, is then passed on to the feed forward network to provide the trainable weights for learning the representation. We can think of attention as how well each word ’attends’ to every other word in the sequence. For this thesis, the implementation of the attention is a scaled down version, in which we use the kernel function to measure the pair-wise similarity between tokens in an input text. As mentioned in the background, I will begin by using the kernel that is known as the ZZ-Feature map. The choice of kernel is another parameter that can be tuned throughout the project. Because the operation is computationally expensive, (O(L2), where L is sequence length), and the current pennylane framework is not compatabile with batching over inputs, the initial architecture for computing the kernel attention is limited to a single evaluation of the kernel matrix for each input sequence. 30 4 Evaluation With the algorithmic framework set, we now have the ability to answer the original research question. The evaluation of the hybrid transformer compared to the clas- sical transformer will be compared using the IMDB dataset for training. To further understand the power of the transformer framework and the power of attention, the transformer will also be analyzed with the attention layer deactivated. 4.1 IMDB dataset A full evaluation of the dataset can be seen in Appendix B. To not introduce a bias in the dataset for length of review and how that influences the sentiment, all of the reviews were truncated or padded to a specified length. In addition, the maximum number of tokens that our model has a limit of 512 tokens. Simple exploratory analysis of the dataset was done to visualize any predefined patterns in the dataset when it comes to sentiment. The average length review across the dataset was around 200 words as can be seen in figure 4.1. Figure 4.1: Review length corresponding to sentiment. No significant indication exists whether positive or negative reviews affect the sentiment. 4.2 Evaluation Metrics When evaluating a machine learning model, the common evaluation metrics are accuracy and loss. In short, the accuracy measure for a classification problem is a 31 4. Evaluation measure of correct predictions over the total number of predictions. This can be formalized by the equation: Accuracy = TP + TN TP + TN + FP + FN (4.1) where TP and TN are true positives and true negatives respectively, and FP and FN are false positives and false negatives respectively. For sentiment analysis, the true positives would be cases where the model correctly predicted the positive sentiment of the text as indicated by the label, and true negative would be the case where the model correctly predicted the negative sentiment of the text. False positive and false negatives would then be incorrect predictions of the text. Before the accuracy can be measured, the model must compute the loss function as seen in equation 4.2. Combined with an activation function to transform a raw probabilistic prediction to a binary classification, the Binary Cross Entropy (BCE) with logits (raw, un-normalized predictions) loss is a PyTorch built in loss function for deep learning models. Where the binary cross entropy calculates the loss by measuring the difference between the predicted probabilities and the true labels. ℓ(x, y) = L = {l1, . . . , lN}⊤ , ln = −wn [yn · log σ (xn) + (1 − yn) · log (1 − σ (xn))] (4.2) The classical optimizer Adam is used to to update the weights of the transformer. After each iteration, the optimizer computes the gradients with respect to the loss function. The parameters are updated according to the learning rate using a com- bination of the first and second moment of the gradients. This can be thought of as an adaptive learning rate, where parameters with a large gradients will not change the learning rate as much and those with small gradients will increase the learning rate. Due to ease of use, no other optimizer or loss function will be explored for this thesis. During the training, to avoid overfitting the data, we use a dropout after each sublayer of the model. The idea behind dropout is to introduce some noise or randomness into the deep network by temporariliy removing a subset of neurons during the training. This prevents the network from relying too heavily on specific neurons. [14] 4.3 Benchmarking The power of transformers arises from this concept of attention. However, a tra- ditional multi-layer perceptron is still a powerful tool for classification. Assessing the performance of the attention mechanism in a transformer model is crucial for understanding its impact on the model’s overall performance. This evaluation can be conducted by comparing the model’s performance with the attention mechanism deactivated and then activating it. When the attention mechanism is deactivated, the transformer model loses its ability to capture relationships between different tokens or elements in the input sequence. In this scenario, the model relies solely on the feed-forward layers, which process each token independently. The hyperpa- rameters used for this assessment can be seen in table 4.1. The model was trained over 20 epochs, with a learning rate of 0.05. The results of the training and testing evaluation can be seen in following figure 4.2. 32 4. Evaluation Figure 4.2: Training of the transformer model with the attention deactivated Table 4.1: Hyperparameters for the attention-less transformer Parameter Value Size of dataset 500 Batch size 16 Sequence Length 50 Embedding Dimension 8 Feed Forward Dimension 64 Number Epochs 20 Number of Transformer blocks 0 Number of heads 1 Vocab size 20,000 Dropout 0.1 At the beginning of the training, the model’s performance shows a significant im- provement as it learns from the data. This is evident by observing an upward trend in the accuracy metric in the graph. As the training continues, the performance gains gradually diminish, and the graph starts to level out. This indicates that the model is reaching a point of saturation where additional training epochs may not yield significant improvements. The leveling out of the graph suggests that the model has learned most of the patterns and information from the training data. It implies that the model is now adequately trained and can generalize well to unseen data. By observing this graph, one can conclude that training the model for more than 20 epochs may not provide substantial benefits in terms of performance im- provement. However, it can be concluded that a simple multi-layer perceptron does provide a good framework for learning based on the performance. 33 4. Evaluation 4.3.1 Previous work We will now look at the performance of a classical transformer. The classical trans- former has been used for sentiment analysis and there are many implementations of this [32]. Because I am not competing to be the best classical transformer, a simple implementation would suffice for the purposes of this thesis as a benchmark. Also because the classical transformer is not the main focus of the paper, an ex- haustive hyperparameter tuning was not performed. The hyperparameters used as comparison to the Di Sipio paper [27] can be seen in table 4.2. Table 4.2: Hyperparameters for the previous work on a quantum transformer Parameter Value Size of dataset 50000 Batch size 32 Sequence Length 64 Embedding Dimension 8 Feed Forward Dimension 8 Number Epochs 1 Number of Transformer blocks 1 Number of heads 2 Vocab size 20,000 Dropout 0.1 When running the hybrid transformer using the hyperparameters outlined in the paper [27], the runtime for such a model was too long to train on the current hardware and my implementation. Therefore, it was deemed necessary to modify the hyperparameters such that a feasible runtime could be achieved to obtain results. From this point, I started to modify the classical transformer and build upon this architecture. 4.3.2 Classical transformer The classic transformer with an encoder-only architecture was run using the full 50k review dataset. Where 30k reviews were used for the training set and 20k reviews were used for the test set. The results for the best performing classic transformer can be seen in the figure 4.3 with the hyperparameters in 4.3. 34 4. Evaluation Table 4.3: Hyperparameters for the classical transformer Parameter Value Size of dataset 50000 Batch size 32 Sequence Length 512 Embedding Dimension 64 Feed Forward Dimension 256 Number Epochs 10 Number of Transformer blocks 4 Number of heads 8 Vocab size 20,000 Dropout 0.1 Figure 4.3: Full dataset used, sequence length of 512 tokens, and ran for 20 epochs. The runtime was ≈ 43 minutes. 35 4. Evaluation Table 4.4: Accuracy and Loss results for the Classical Transformer Epoch Training Loss Training Accuracy (%) Testing Accuracy (%) Epoch Runtime (s) 0 0.560 70.02 80.14 132.4 1 0.414 81.15 84.13 117.7 2 0.345 85.28 85.65 118.5 3 0.347 85.16 85.54 117.1 4 0.311 86.93 87.18 118.4 5 0.245 90.29 88.17 119.2 6 0.222 91.54 88.81 116.2 7 0.203 92.45 89.37 116.5 8 0.184 93.34 89.80 117.9 9 0.171 93.97 89.99 117.1 10 0.157 94.50 90.13 117.1 18 0.142 95.35 90.41 118.1 19 0.144 95.20 90.43 118.5 Total Runtime: 43m, 3s It is clear from the test and training data that this model performs well. However, comparing to published classical transformers analyzed on the same dataset, it falls short by about 2% from making the leaderboard [32]. After about 10 epochs, the training seems to slow down as it reaches a plateau at around 90% on testing accu- racy. We can see a dip in the accuracy after the 2nd and 3rd epoch, but the model quickly improves after. 4.3.3 Variational Quantum Circuit The variational circuit was the best performing architecture for the quantum atten- tion due to the availble tuning of the hyperparameters. This increase in performance is likely linked to the bigger size of the dataset, i.e. being trained on more data. With the first 5000 reviews used, and a sequence length of 50 tokens, the runtime for 5 epochs with this model was 3841.8 minutes. Compared to the classical trans- former, which ran on a dataset that was 10x the size, and 20 epochs were completed in 43 minutes. The resulting data can be seen in table 4.6 and figure 4.6. 36 4. Evaluation Table 4.5: Hyperparameters for the variational quantum attention transformer Parameter Value Size of dataset 5000 Batch size 32 Sequence Length 50 Embedding Dimension 8 Feed Forward Dimension 8 Number Epochs 5 Number of Transformer blocks 1 Number of heads 1 Vocab size 20,000 Dropout 0.1 Figure 4.4: First 5000 reviews used, sequence length of 50 tokens, and ran for 5 epochs. The run-time was 3841.8 minutes 37 4. Evaluation Table 4.6: Accuracy and Loss results for the Variational Quantum Circuit Atten- tion Transformer Epoch Training Loss Training Accuracy (%) Testing Accuracy (%) Epoch Runtime (s) 1 0.693 50.62 52.42 46506.4 2 0.685 56.59 54.62 46300.4 3 0.681 57.29 57.12 45958.0 4 0.673 59.82 61.02 45852.4 5 0.661 62.19 62.62 45869.0 Total Runtime: 3841m, 47s The increase in runtime was not feasible to scale this model any higer to directly compare the best running classical transformer or attention-less transformer. The runtime comparison between the three models will be discussed in section 4.5. 4.4 Quantum Kernel Attention The kernel attention mechanism is currently running without training weights, but just as a single pairwise similarity measure. Even with this scaled down version of the attention mechanism, after 5 epochs, the training had taken 107.8 hours. This scale of runtime is not feasible for even a small model, and parallelization will need to be done to optimize runtime. However, it is interesting to note that while it took a long time, the model was able to produce accuracy results that were better than the variational circuit for this scale. The comparison can be seen in table 4.10. Table 4.7: Hyperparameters for the Quantum Kernel Attention Transformer Epoch Value Size of dataset 1000 Batch size 16 Sequence Length 50 Embedding Dimension 8 Feed Forward Dimension 256 Number Epochs 5 Number of Transformer blocks 1 Number of heads 1 Vocab size 20,000 Dropout 0.1 38 4. Evaluation Figure 4.5: The first 1000 reviews were used. A sequence length of 50 tokens, which was ran for 5 epochs. The runtime of this was 6469 minutes. Table 4.8: Accuracy and Loss results for the Quantum Kernel Attention Trans- former Epoch Training Loss Training Accuracy (%) Testing Accuracy (%) Epoch Runtime (s) 1 0.699 50.42 46.88 79449.3 2 0.692 51.42 55.86 77806.0 3 0.687 56.43 53.62 77725.3 4 0.686 57.60 59.10 75905.4 5 0.685 58.43 60.35 77305.2 Total Runtime: 6469m, 50s 4.5 Comparison This section will explore how the three models compare against each other. While the optimal performances had used a variety of different parameters, this section will explore how each of the models perform across an equally complex architecture using the same hyperparameters, numbeer of reviews, sequence length, number of epochs ran, etc. Table 4.9 shows the parameters used for comparison. 39 4. Evaluation Table 4.9: Parameters used to compare the three model architectures Epoch Value Size of dataset 1000 Batch size 32 Sequence Length 50 Embedding Dimension 8 Feed Forward Dimension 8 Number Epochs 5 Number of Transformer blocks 1 Number of heads 1 Vocab size 20,000 Dropout 0.1 Figure 4.6: First 1000 reviews used, sequence length of 50 tokens, and ran for 5 epochs. A comparison across all three models tested. The accuracies and runtimes of each model can be seen in the table 4.10 . It is important to note that the models were scaled to use the same hyperparameters during this comparison test. Looking at the comparison testing, the quantum kernel performed the best among the three models. However, it is worthwhile to note that we saw a similar dip in accuracy at around 5 epochs for the classical transformer when trained on the full size dataset and a review length 5x longer. It can only be assumed that if the classical transformer were to run for a longer time, that the accuracy would increase considering the performance seen in table 4.4. It is important to look at the runtimes for each of the models as well. Even across this small subset of data, the runtime 40 4. Evaluation Table 4.10: Comparison of different architectures based on accuracy and runtime after 5 epochs. Architecture Train Accuracy (%) Test Accuracy (%) Runtime (m, s) Classic 60.10 57.11 0m 29s VQC 58.26 57.61 778m 47s Quantum Kernel 58.43 60.35 6469m 50s for the classical transformer was 29 seconds including the preprocessing of the data which adds roughly 18s in runtime while the runtime for the two quantum models was 778 minutes and 6469 minutes for the variational quantum circuit and the quantum kernel attention, respectively. While the quantum kernel performed the best according to the accuracy score, the computational cost was extremely high. The most efficient model is evidently the classical model, which also has the highest training accuracy. Additional figures for the other configurations tested can be found in appendix B. In these runs, the size of the dataset used was varied from 500 reviews, 1000 reviews, 5000 reviews and 50000 reviews for the classic transformer as well as varying the sequence length from 30, 50 to 512. These were ran for 5 and 20 epochs. While not much difference is seen from 500 to 1000 reviews, when using a minimum sequence length of 30 and running for 5 epochs, the 50000 review model significantly out performed the others, suggesting that the size of the dataset has a considerable impact on the performance of the model. We can also see this trend when looking at the two hybrid models as well with performance increases from 500 to 1000 reviews and the same sequence length and number of epochs ran. A less significant increase in performance is seen when looking at the effect of the sequence length on performance. But the impact on the runtime naturally increases with increased complexity of the model. When investigating the results even further, we can see how the performance on the training set and testing set differs based on the size of the data used. Compared to the optimal classical model and optimal variational quantum circuit transformers, the quantum kernel performs poorly, comparing 60% accuracy compared to around 90% for the classical transformer. Even with the attention component deactivated, that model significantly out-performed the quantum kernel model and the vqc rel- ative to runtime, and training and accuracy scores. However, since the run-time for the hybrid transformer is much longer than that of a classical transformer, we needed to truncate the dataset and the length of the review. When we look at the sample texts from the testing set, it can be seen that the words that appear in the testing set, have not appeared in the training set, such that the model will have not been trained on the sentiment of that word so probability of getting that particular word correct, and how it relates to other words in the review would be simply a guess. 41 4. Evaluation 42 5 Conclusion This chapter will summarize the results and what was learned throughout the course of the thesis. I will also address the limitations and future directions of this work. As with many projects, the scope and scale of the project had to be adapted to accommodate complications that arose during programming. 5.1 Summary In summary, I focused on comparing three different transformer architectures and analyzed their performance using the IMDB sentiment analysis dataset. I began this thesis with a background on Natural Language Processing, and the different tasks that it can accomplish. Then to understand how the implementation of these NLP tasks is accomplished, I inserted a brief background on machine learning and quantum computing, including the qubit and circuit model, and how quantum gates can be used to construct quantum algorithms. We walked through the framework of the model and how the evaluation is to take place. After looking at the results, we can now address the research questions. Research Question 1 Can we implement a quantum kernel as the attention mech- anism for Transformer models? Yes, the attention is just a similarity measure between tokens in a high dimensional feature space. Because the kernel is not weighted at this point, the attention is just the similarity measure and that similarity value is then passed on to the feed forward network to then learn based on those similarities. I would say that this is also a future work, to add weights to the kernel values such that the circuit is parameterized and those parameters can be attended to throughout training. Research Question 2 Can the quantum kernel provide a computational speedup compared to using a classic self attention of the transformer model? Definitely not at this point, at least not with this implementation. Research Question 3 Does the training of a large language model using a quantum attention result in higher accuracy compared to a classic attention model? At this point, the classical circuit is the best performing model Research Question 4 How does this implementation of a hybrid transformer with quantum kernel attention compare to previous work done using a variational quan- tum circuit as the attention? Compared to previous work, the performance is similar with regards to accuracy. However, the runtime of the two differ by about a factor of 5 for even the simplest of models. 43 5. Conclusion 5.2 Limitations The question regarding why kernels are not used more frequently in today’s architec- tures can be answered for the same reason why the performance of the architecture did not perform as well as the other transformer architectures. The pairwise com- putation of the kernel limits how scalable the model is. With the current support from the Quantum Machine Learning backend Pennylane, there is not a good way to parallelize the computation of the pairwise attention. As such, the computation is very slow and limits the usability of the framework due to time constraints. As I was writing this thesis, TorchQuantum library was introduced out of an MIT lab, which enables batching support for quantum circuits. However, the implementation would have required to learn a new library and the time did not permit this. Another limitation of this work, which I think was also a valuable learning opportunity for me was how to efficiently write scalable code for deep tensor networks. Due to my inexperience with using deep learning frameworks, I think that I had a steep learning curve to become comfortable with writing efficient code. I think ultimately, there might have been a better way to implement the kernel for the quantum attention which would have scaled better with the model, but for the purposes of the thesis, I will leave that for future work. 5.3 Future work Because there was not an exhaustive hyper parameter search, I also think that this is a future direction. I began exploring the effects of the learning rate and how that effects the training. Also would like to explore to add a trainable parameterized layer in the quantum circuit for the kernel. To be able to train a quantum kernel, would create a frame work more similar to the classical attention that works so well. Without a trainable parameter, the attention in the quantum feaure space is just a scalar value that does not change throughout the course of the training. 44 Bibliography [1] Francesco Bernardini, Abhijit Chakraborty, and Carlos Ordóñez. Quan- tum computing with trapped ions: a beginner’s guide. arXiv preprint arXiv:2303.16358, 2023. [2] Avi Chawla, Nidhi Mulay, Vikas Bishnoi, Gaurav Dhama, and Anil Kumar Singh. A comparative study of transformers on word sense disambiguation. CoRR, abs/2111.15417, 2021. [3] El Amine Cherrat, Iordanis Kerenidis, Natansh Mathur, Jonas Landman, Mar- tin Strahm, and Yun Yvonna Li. Quantum vision transformers. arXiv preprint arXiv:2209.08167, 2022. [4] Andrew J. Daley, Immanuel Bloch, Christian Kokail, et al. Practical quantum advantage in quantum simulation. Nature, 607(7919):667–676, 2022. [5] David P. DiVincenzo. The physical implementation of quantum computation. Fortschritte der Physik, 48(9–11):771–783, September 2000. [6] Giulia Ferrini, Anton Frisk-Kockum, Pontus Vikstål, and Ariadna Soro. Quan- tum computing lecture notes, 2023. [7] Fulvio Flamini, Nicolò Spagnolo, and Fabio Sciarrino. Photonic quantum in- formation processing: a review. Reports on Progress in Physics, 82(1):016001, November 2018. [8] Vojtvech Havlicek, Antonio D. Corcoles, Kristan Temme, Aram W. Harrow, Abhinav Kandala, Jerry M. Chow, and Jay M. Gambetta. Supervised learn- ing with quantum-enhanced feature spaces. Nature, 567(7747):209–212, March 2019. [9] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel meth- ods in machine learning. The Annals of Statistics, 36(3), June 2008. [10] Thomas Hubregtsen, David Wierichs, Elies Gil-Fuster, Peter-Jan H. S. Derks, Paul K. Faehrmann, and Johannes Jakob Meyer. Training quantum embedding kernels on near-term quantum computers. Physical Review A, 106(4):042431, October 2022. [11] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016. [12] P. Krantz, M. Kjaergaard, F. Yan, T. P. Orlando, S. Gustavsson, and W. D. Oliver. A quantum engineer’s guide to superconducting qubits. Applied Physics Reviews, 6(2):021318, 2019. [13] Guangxi Li, Xuanqiang Zhao, and Xin Wang. Quantum self-attention neu- ral networks for text classification. Science China Information Sciences, 67(4):142501, 2024. 45 Bibliography [14] Zhuang Liu, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, and Trevor Darrell. Dropout reduces underfitting. In International Conference on Machine Learning, pages 22233–22248. PMLR, 2023. [15] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. [16] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [17] Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc’Aurelio Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014. [18] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quan- tum Information: 10th Anniversary Edition. Cambridge University Press, New York, NY, USA, 10th edition, 2011. [19] OpenAI. Chatgpt release notes. https://help.openai.com/en/articles/ 6825453-chatgpt-release-notes#h_e18c6a1b3a, 2023. [20] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. [21] John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, Aug 2018. [22] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Physical Review Letters, 113(13):130503, September 2014. [23] Bernhard Schölkopf and Alexander J. Smola. A short introduction to learning with kernels. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, volume 2600 of Lecture Notes in Computer Sci- ence, pages 41–64. Springer, Berlin, Heidelberg, 2003. [24] Maria Schuld. Kernel-based training of quantum models with scikit-learn. https://pennylane.ai/qml/demos/tutorial_kernel_based_training, 02 2021. [25] Maria Schuld. Supervised quantum machine learning models are kernel meth- ods. arXiv preprint arXiv:2101.11020, 2021. [26] Maria Schuld and Francesco Petruccione. Machine Learning with Quantum Computers. Quantum Science and Technology. Springer, 2nd edition, 2021. [27] Riccardo Di Sipio, Jia-Hong Huang, Samuel Yen-Chi Chen, Stefano Mangini, and Marcel Worring. The dawn of quantum natural language processing. CoRR, abs/2110.06510, 2021. [28] Princeton University. About wordnet. https://wordnet.princeton.edu/, 2010. 46 https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_e18c6a1b3a https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_e18c6a1b3a https://pennylane.ai/qml/demos/tutorial_kernel_based_training https://wordnet.princeton.edu/ Bibliography [29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. [30] Manuela Weigold, Johanna Barzen, Frank Leymann, and Marie Salm. Ex- panding data encoding patterns for quantum algorithms. In 2021 IEEE 18th International Conference on Software Architecture Companion (ICSA-C), pages 95–101, 2021. [31] Manuela Weigold, Johanna Barzen, Frank Leymann, and Marie Salm. Data encoding patterns for quantum computing. ACM Computing Surveys, 55(9):1– 37, 2022. [32] Papers with Code. State-of-the-art sentiment analysis on imdb. https:// paperswithcode.com/sota/sentiment-analysis-on-imdb, 2023. [33] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6