Productivity Applications of Language Intelligence Modeling in Domain Specific Research Master’s Thesis in Systems, Control and Mechatronics Sanam Molaee DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis in Systems, Control and Mechatronics Productivity Applications of Language Intelligence Modeling in Domain Specific Research Sanam Molaee Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2024 Productivity Applications of Language Intelligence Modeling in Domain Specific Research Sanam Molaee © Sanam Molaee, 2024. Supervisor: Dr. Jawwad Ahmed, Autoliv Research Examiner: Prof. Martin Fabian, Department of Electrical Engineering, Chalmers Degree project report 2024 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Sweden Telephone +46 31 772 1000 Cover: Retrieval Augmented Generation Pipeline Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Gothenburg, Sweden 2024 iv Productivity Applications of Language Intelligence Modeling in Domain Specific Research Sanam Molaee Department of Electrical Engineering Chalmers University of Technology Abstract The popularity of large language models (LLMs) has led to their widespread adop- tion in various domains for diverse tasks. These models, characterized by their vast number of parameters, show advanced capabilities like solving complex problems and multi-step reasoning. Despite potential risks, they are seen as promising for enhancing productivity and workflows in numerous areas. The aim of this project is to integrate LLMs into research workflows in the automo- tive safety domain to streamline processes. It focuses on assessing and selecting a suitable open-source LLM for this purpose, as well as implementing capabilities such as semantic search and automatic insight generation from the given data. For the complex question answering, a retrieval augmented generation (RAG) pipeline is im- plemented, which is then shown to be a viable approach to exploiting the capabilities of large language models. The key deliverable is a proof-of-concept demonstrating the practical application of LLMs in processing and analyzing domain-specific data. In this project it is shown that the open-source LLM can have a significant role in enhancing the productivity of domain specific research. Also, by comparing the per- formance with GPT-3.5 Turbo, which is a proprietary model and has higher costs, it can be seen that the open-source model provides a competitive performance. Keywords: Natural language processing, Generative AI, Large language models, Retrieval augmented generation v Acknowledgements I would like to thank Dr. Jawwad Ahmed, my industrial supervisor at Autoliv, whose insights and guidance has been an invaluable assist for me throughout this project. I would also like to thank Prof. Martin Fabian, my academic supervisor and ex- aminer at Chalmers, for his helpful comments, patience and support during this work. Sanam Molaee, Gothenburg, 2024 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: BERT Bidirectional Encoder Representations from Transformers GPT Generative Pre-trained Transformer GUI Graphical User Interface GT Ground truth LLM Large Language Model NLP Natural language processing PLM Pre-trained language models RAG Retrieval Augmented Generation ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Text embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4 Large language models . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.5 Techniques for using LLMs . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1 Prompt engineering . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.2 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.3 RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6.2 Answer relevancy . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6.3 Context precision . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6.4 Context recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6.5 Answer semantic similarity . . . . . . . . . . . . . . . . . . . . 8 2.6.6 Answer correctness . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Method 11 3.1 Open-source LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Embedding model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 RAG pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Data indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.2 Retrieval and Generation . . . . . . . . . . . . . . . . . . . . . 13 xi Contents 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.1 Proprietary LLM . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.2 Synthetic dataset generation . . . . . . . . . . . . . . . . . . . 15 3.4.3 RAGAS evaluator . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Results 17 4.1 Related questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 Results with respect to evolution type . . . . . . . . . . . . . 17 4.1.2 Overall results for each metric . . . . . . . . . . . . . . . . . . 18 4.2 Unrelated questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 Conclusion 23 6 Future Work 25 Bibliography 27 xii List of Figures 2.1 Transformer architecture . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 RAG pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Evaluation pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Answer relevancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Context precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Context recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.5 Answer semantic similarity . . . . . . . . . . . . . . . . . . . . . . . . 21 4.6 Answer correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.7 Overall results for the related questions . . . . . . . . . . . . . . . . . 22 4.8 Overall results for the unrelated questions . . . . . . . . . . . . . . . 22 xiii List of Figures xiv List of Tables 3.1 Sample Table with Temperature, Top-p, and Top-k . . . . . . . . . . 12 xv List of Tables xvi 1 Introduction This project has been done in collaboration with Autoliv for exploring the integra- tion of LLMs into research workflows to enhance cognitive capabilities and stream- line processes. It focuses on assessing and selecting suitable LLM for enterprise settings and building a research system with capabilities such as semantic search and automatic insight generation from Autoliv’s inhouse dataset containing publi- cations in the field of automotive safety. The key deliverable of this research is a proof-of-concept demonstrating the practical application of LLMs in processing and analyzing domain-specific data. 1.1 Background The popularity of LLMs has rapidly increased, leading to their widespread adoption across various domains for different tasks. These models, characterized by their vast number of parameters, exhibit advanced capabilities such as solving complex problems, performing multi-step reasoning, and generating human-like text. Their proficiency also extends to tasks like text summarization, translation, sentiment analysis, and conversational agents, making them invaluable tools in sectors such as healthcare, finance, education, and research. Despite the potential risks associated with their use, including ethical concerns, biases, and the possibility of generating misleading information, LLMs are increas- ingly seen as transformative for enhancing productivity and workflows [1]. Their ability to automate repetitive tasks, provide insights from large datasets, and assist in decision-making processes can significantly boost efficiency and accuracy. For ex- ample, in educational settings, LLMs can provide personalized learning experiences and automate grading [2]. Overall, the adoption of LLMs marks a significant step towards leveraging artificial intelligence to improve productivity, drive innovation, and tackle complex challenges in numerous domains. 1.2 Motivation The motivation for this thesis comes from the rapid growth and capabilities of LLMs in transforming research methodologies. With the increasing complexity and volume of data in research, there is a growing need for tools that can enhance efficiency and productivity. Also, mostly the research data we have is unstructured, spread across many sources, vast in volume, and sensitive in nature. This motivates the need for 1 1. Introduction an advanced NLP based solution that can not only provide advanced knowledge extraction capabilities, but also provides the ability of a digital expert to rapidly respond to complex technical questions in the specific area in a secure enterprise environment. This thesis is driven by the potential of LLMs to revolutionize data analysis, information retrieval, and insight generation in domain-specific research, addressing current limitations. 1.3 Objectives The main goal of the project is to explore current open-source LLMs for enhancing research workflows. It focuses on assessing and choosing suitable LLMs for enterprise use, based on criteria including complexity, capabilities, and performances. The project will concentrate on in-context learning and history maintenance to utilize the zero/few-shot learning abilities of LLMs for domain-specific tasks. 1.4 Research questions Three key questions are discussed in this research. 1. Can current state-of-the-art LLM technologies be used to improve the research workflows in the automotive safety domain by enhancing productivity or qual- ity of work? 2. Is exploiting zero/few shot capabilities of LLMs, for example RAG, a viable approach? 3. Under the RAG scenario, can open-source LLMs provide competitive perfor- mance compared to closed source/proprietary LLM models? 2 2 Theory This chapter discusses the theoretical foundations of this project including key areas such as NLP, the architecture of Transformers, the importance of Text Embeddings, LLMs, methods for using them, and the evaluation metrics. 2.1 Natural language processing NLP involves techniques designed to allow computers to interpret, process, and pro- duce human language in ways that are both meaningful and practical [3]. This field combines elements of computational linguistics, which provides structured models for understanding language, with various machine learning and deep learning ap- proaches. These models are developed to simulate human language understanding capabilities, enabling computers to interact with humans through text or spoken words. 2.2 Transformers Transformers are models specifically engineered for handling sequential data, such as text, for NLP tasks [4]. In transformers, data processing is facilitated entirely through layers known as attention layers. These layers are designed to dynamically weigh the relevance of different parts of the input data. The transformer archi- tecture generally consists of an encoder and a decoder. The encoder processes the input data into a continuous representation that holds all the learned insights about the input. The decoder then uses this representation, along with previous output elements, to generate the output sequence. Each layer in both the encoder and decoder contains a series of self-attention and feed-forward neural network layers, which help in refining the transformations and representations at each step. The architecture of a transformer can be seen in Figure 2.1. 2.3 Text embeddings Text embeddings are an important element in NLP tasks. These embeddings trans- form text into low-dimensional vector forms that encapsulate the semantic relation- ships of words or phrases within text [5]. This transformation is achieved through vector operations, which mathematically analyze and map the semantic similarities and differences between texts into a spatial representation. Cosine similarity is a 3 2. Theory Figure 2.1: Transformer architecture [4] key metric in this context. It measures the cosine of the angle between two vectors, providing a value between -1 and 1. These values indicate how similar the two vec- tors are, with 1 meaning two vectors are pointing in the same direction and have maximum similarity, 0 meaning they are orthogonal with no similarity, and -1 mean- ing they are pointing in exactly opposite directions with maximum dissimilarity [6]. Mathematically, cosine similarity between two vectors A and B is calculated as: cosine similarity = A · B ∥A∥∥B∥ (2.1) This measure is particularly useful in semantic search and information retrieval, where it helps in finding documents or text segments that are contextually similar to a given query. By focusing on the direction rather than the magnitude of the vectors, the cosine similarity effectively captures the semantic relevance, making it a powerful tool for comparing text embeddings. 4 2. Theory 2.4 Large language models LLMs represent a subset of NLP models, pre-trained on vast amounts of text data from diverse sources, which equip them with a comprehensive grasp of human lan- guage. Based on transformer architectures, these models utilize attention mecha- nisms to enhance their predictive and generative text capabilities. LLMs are trained using a process called unsupervised learning. This involves feeding the model mas- sive amounts of text data and having the model learn the patterns and relationships between words and phrases in the text. They are known for their ability to generate coherent and contextually appropriate text responses from given prompts. Their major advantages include the ability to adapt to various language tasks without direct task-specific training, often demonstrating human-like writing abilities [3]. 2.4.1 Types One prominent model based on the Transformer architecture is BERT [7]. BERT focuses on understanding the context of a word based on its surrounding words in both directions (left and right). This bidirectional approach allows for a more comprehensive grasp of linguistic subtleties, making BERT particularly effective for tasks such as question answering and sentiment analysis. Another significant model in the LLM landscape is the GPT Series, including GPT- 2 and GPT-3 [8, 9]. GPT models are autoregressive, meaning they generate text by predicting the next word in a sequence based on the words that came before it. These models are pre-trained on a large corpus of text and then fine-tuned for specific tasks. GPT models have demonstrated remarkable capabilities in text generation, conversation, and even creative writing, thanks to their extensive pre-training on diverse datasets. T5 (Text-To-Text Transfer Transformer) [10], represents another innovative ap- proach in the realm of LLMs. T5 converts all NLP tasks into a text-to-text format, enabling a unified framework for various tasks such as translation, summarization, and question answering. This text-to-text paradigm simplifies the model architec- ture and makes it versatile in handling different NLP tasks with a single model. 2.4.2 Parameters LLMs have various parameters, but three of the most essential ones are as follows. 1. Top K: This parameter limits the model’s predictions to the k most probable tokens at each generation step. By assigning a value to k, the model will be guided to focus on certain tokens with the most probability. 2. TOP P: This model controls the cumulative probability of the generated to- kens. The tokens will be generated until the cumulative probability exceeds the threshold P. 3. Temperature: This is a scaling factor, applied within the model to shape the probability distribution of the next model. Higher temperature adds more randomness in the output. 5 2. Theory 2.5 Techniques for using LLMs There are three main techniques for adapting LLMs to a specific task. 2.5.1 Prompt engineering Prompt engineering is the process of creating and using specific instructions, known as prompts, to guide the output of pre-trained LLMs [11]. This technique is signifi- cant because it modifies model behavior without altering the core model parameters, enhancing the flexibility and effectiveness of these models in various applications. The essence of prompt engineering lies in its capacity to integrate pre-trained mod- els seamlessly into diverse tasks through well-crafted instructions. This approach contrasts with traditional methods that often necessitate extensive model retraining or fine-tuning, presenting a more efficient alternative for adapting models to new tasks. There are various prompt engineering techniques, but the focus in this project will be on two of them: 1. Zero-shot prompting: where the model leverages its general knowledge to tackle new tasks without specific training, based on the prompt description alone. 2. Few-shot prompting: This method enhances model understanding of a task by providing a handful of examples, improving its performance on similar tasks. 2.5.2 Fine-tuning Fine-tuning involves taking a pre-trained LLM and adapting it to a specific task or domain by further training it on a smaller, task-specific dataset. This additional training process fine-tunes the model’s parameters, allowing it to perform better on the new task compared to its performance as a general-purpose language model [12]. Although highly effective in generating precise outputs, this approach comes with significant initial costs, such as computational resources [13]. Also, fine-tuning usu- ally is a supervised learning process, which necessitates having a labeled dataset. 2.5.3 RAG RAG is a model designed to enhance the performance of language models on knowledge- intensive tasks by integrating retrieval mechanisms with generative models [14]. The key innovation of RAG is to augment the generation process with relevant informa- tion fetched from a large external corpus, enabling the model to produce more accurate and contextually rich responses. Main components of a RAG pipeline are [15]: 1. Retrieval: This component retrieves relevant documents based on input queries using dense vector representations, which facilitates efficient and accurate sim- ilarity computations. 2. Generation: In this step the generative model conditions on the input query and the retrieved documents to generate responses. This allows the model to 6 2. Theory leverage up-to-date external knowledge, enhancing contextual relevance and factual accuracy. Different types of RAG include: 1. Naive RAG: This combines retrieval and generation with minimal interaction. The generative model directly uses retrieved documents as additional context. 2. Advanced RAG: This employs techniques like fine-tuning of the retrieval mech- anism and alignment between retrieval and generation to enhance the quality of responses to ensure better relevance and coherence. 3. Modular RAG: This separates RAG into distinct, interchangeable modules for retrieval, generation, and augmentation, allowing for more targeted improve- ments and flexibility. 2.6 Evaluation The performance of a RAG pipeline can be evaluated by various metrics [16]. In this research, the focus has been on six metrics discussed in this section. 2.6.1 Faithfulness The faithfulness metric evaluates the factual consistency of the generated answer compared to the provided context, calculating it from the answer and the retrieved context. The results are scaled to a range of [0,1], with higher scores indicating better performance. An answer is considered faithful if all the statements made within it can be derived from the context provided. To determine this, a list of claims from the generated answer is first identified. Each of these claims is then cross-referenced against the given context to verify if it can indeed be inferred from it. The faithfulness score is calculated as: Faithfulness score = |Number of claims in the generated answer that can be inferred from given context| |Total number of claims in the generated answer| (2.2) 2.6.2 Answer relevancy The answer relevancy metric measures the relevance of the generated answer to the specified prompt. It assigns lower scores to answers that are either incomplete or contain excessive information, while higher scores reflect greater relevance. This metric is calculated using the question, its context, and the answer. Answer rele- vancy is determined by calculating the average cosine similarity between the original question and a set of artificial questions that have been reverse-engineered from the answer. The answer relevancy score is calculated as: Answer relevancy = 1 N N∑ i=1 cos(Egi, Eo) = 1 N N∑ i=1 Egi · Eo ∥Egi∥∥Eo∥ , (2.3) 7 2. Theory where Egi is the embedding of the generated question i, Eo is the embedding of the original question, and N is the number of generated questions. It is worth mentioning that although the score typically ranges between 0 and 1, in practice this range is not mathematically guaranteed. This is because, as mentioned earlier, the nature of cosine similarity allows for values ranging from −1 to 1. 2.6.3 Context precision Context precision is a metric designed to assess whether all relevant ground truth items in the contexts are ranked appropriately high. Ideally, all relevant chunks should appear in the topmost ranks. This metric is calculated using the question, ground truth data, and the contexts, with values spanning from 0 to 1. Higher scores indicate greater precision. Context precision is calculated as: Context Precision@K = ∑K k=1(Precision@k × vk) Total number of relevant items in the top K results (2.4) Precision@k = true positives@k true positives@k + false positives@k (2.5) where K is the total number of chunks in the contexts and vK ∈ {0, 1} is the relevance indicator at rank k. 2.6.4 Context recall Context recall evaluates how closely the retrieved context matches the annotated answer, considered as the ground truth. This metric is calculated using the ground truth and the retrieved context, with values ranging from 0 to 1, where higher scores indicate better performance. To assess context recall from the ground truth answer, each sentence within the ground truth answer is examined to see if it corresponds to the retrieved context. Ideally, every sentence in the ground truth answer would be attributed to the retrieved context. The context recall score can be calculated as: Context recall = |GT sentences that can be attributed to context| |Number of sentences in GT| (2.6) 2.6.5 Answer semantic similarity Answer semantic similarity enables evaluating how semantically similar the gener- ated answer is to the ground truth. This metric is derived from both the ground truth and the answer, with scores ranging from 0 to 1. Higher scores indicate a stronger semantic alignment between the generated answer and the ground truth. Assessing semantic similarity provides important information about the quality of the generated response. This process employs a cross-encoder model to determine the semantic similarity score. The process is done through three steps: 1. Vectorize the ground truth answer using the specified embedding model. 2. Vectorize the generated answer using the same embedding model. 3. Compute the cosine similarity between the two vectors. 8 2. Theory 2.6.6 Answer correctness The answer correctness score measures the precision of the generated answer relative to the ground truth. This analysis depends on comparing the ground truth with the answer, and the scoring system ranges from 0 to 1. A higher score reflects a more accurate match with the ground truth, indicating greater correctness. Answer correctness involves two essential factors: the semantic similarity and the factual accuracy between the generated answer and the ground truth. These factors are combined using a weighted formula to calculate the answer correctness score. The computation of the answer correctness score involves adding the factual correctness and the semantic similarity between the provided answer and the ground truth. Factual correctness assesses the factual overlap between the generated answer and the ground truth answer. This assessment is based on three concepts: 1. TP (True Positive): Facts that appear in both the ground truth and the generated answer 2. FP (False Positive): Facts that appear in the generated answer but are absent in the ground truth 3. FN (False Negative): Facts that are missing in the generated answer but are appear in the ground truth An F1 score is then used to calculate the factual correctness as: F1 Score = |TP| (|TP| + 0.5 × (|FP| + |FN|)) (2.7) Then, a weighted average of the factual similarity and semantic similarity will be taken to achieve the answer correctness score. 9 2. Theory 10 3 Method This section outlines the comprehensive methodology employed in this study. The dataset consisted of research articles and technical documents in automotive safety area from Autoliv, all in PDF format. The methodology encompasses several key stages, including choosing the suitable open-source LLM and embedding model, building a RAG pipeline, and evaluating its performance. Additionally, a user- friendly GUI was developed to facilitate interaction with the models, ensuring prac- tical applicability and ease of use. It is worth mentioning that the source code was written in Python. 3.1 Open-source LLM Zephyr-7B-β model [17] was selected as the LLM for this project and Hugging Face was used for accessing this model. Part of the Zephyr series of language models, the Zephyr-7B-β represents a specialized iteration, having been fine-tuned from the pre-trained generative text model Mistral-7B-v0.1 [18]. This model consists of 7 billion parameters and has undergone fine-tuning using a publicly accessible synthetic dataset that includes 1.4 million dialogues generated by GPT-3.5 Turbo covering a variety of topics. The reason for choosing this model, was its competi- tive performance based on MT-Bench [19] and AIpacaEval [20]. This model, with only 7 billion number of parameters, has the best performance compared to other open-source models with bigger sizes, up to 70 billion parameters [17]. This small model size made Mistral-7B-v0.1 a low complexity model that can provide good performance while running on low to mid range devices. Different combinations of temperature, Top-p, and Top-k based of previous findings were evaluated to deter- mine the optimal values for efficient response generation [21]. Table 3.1 illustrates the time taken by the language model to generate responses under varying parame- ter settings. The results indicate that the optimal configuration is a temperature of 0.7, a Top-p of 0.95, and a Top-k of 50. It is important to note that this comparison was conducted on the same dataset, yielding identical responses, thereby ensuring that the only variable affecting the outcome was the response time. 3.2 Embedding model BAAI/bge-small-en-v1.5 [22] has been used as the text embedding model and Hugging Face was used for accessing this model. The decision for this model came from its evaluation via MTEB [23], a widely recognized evaluator in the field. With 11 3. Method Temperature Top-p Top-k time[s] 0.6 0.95 50 720 0.7 0.95 40 510 0.7 0.9 50 490 0.7 0.95 50 360 0.7 0.95 60 1670 0.8 0.95 50 1800 Table 3.1: Sample Table with Temperature, Top-p, and Top-k 33 million parameters and 384 embedding dimensions, this model strikes a balance between resource efficiency and performance. Notably, it consumes less memory compared to other models, which is a crucial feature for our study. Despite this, it achieves a high MTEB score, indicating its effectiveness and suitability for our research objectives. 3.3 RAG pipeline The RAG pipeline was developed in two main stages, using the LlamaIndex frame- work [24]. An illustration of the pipeline is shown in Figure 3.1. As can be seen, the data is first organized and indexed, making it ready for search queries. When a user submits a query, the index narrows down the most pertinent data to use. This selected data, along with the user’s query and a specific prompt, are then sent to the LLM, which generates a response based on this context. Figure 3.1: RAG pipeline 12 3. Method 3.3.1 Data indexing For loading the dataset, the SimpleDirectoryReader function, which is part of LlamaIndex library, is used. This function breaks each document to its number of pages and generates metadata for each page, containing a dictionary of annotations that can be appended to the text. This brings structure to each unstructured text file. Afterwards, the documents were parsed into nodes and split into chunks with size of 300 characters. Small chunk sizes enable the model to perform better in terms of semantic search, but with decreasing the chunk size, the memory usage increased significantly. Therefore, the chunk size of 300 was chosen to be a viable option. In the next step, the documents were indexed. VectorStoreIndex was used to create vector embeddings for each node via the BAAI/bge-base-en-v1.5 text embedding model. The indexed data was then stored on local disk in JSON format. These files include index IDs, containing the metadata for indexes. 3.3.2 Retrieval and Generation For this part, as_chat_engine was used for prompting the model and enabling it to retrieve the relevant information for the query from the stored vector indices. The model was given a prompt as shown in Listing 3.1. 1 context_prompt =( 2 "You are an expert Q&A system that is trusted around the world .\n" 3 " Always answer the query using the provided context information , " 4 "and not prior knowledge .\n" 5 "Some rules to follow :\n" 6 "1. Write the name of the study you used to get your response ." 7 "2. Avoid statements like ’Based on the context , ...’ or " 8 "’The context information ...’ or anything along " 9 "those lines." 10 "Here is some context that may be relevant :\n" 11 " -----\n" 12 "{ node_context }\n" 13 " -----\n" 14 " Please write a response to the following question , using the above context :\n" 15 "{ query_str }\n") Listing 3.1: System prompt given for the query Via the embedding model, the 20 most related contexts to the query were retrieved, enabling the model to perform semantic search. In this model, the similarity be- tween the context and query was evaluated using average cosine similarity. To integrate the capability for automatic insight generation into the model, a com- bination of a query and word chunks derived from split documents was employed. Listing 3.2 demonstrates how the initial chunk of the split document, which is nodes[0].get_content() in Listing 3.2, has been incorporated into the prompt. 1 response = chat_engine .chat(’Write a short summary and result of the document provided below :\n’+nodes [0]. get_content ()) Listing 3.2: Query for the text summarization (automatic insight generation) 13 3. Method Another feature implemented in the model is history maintenance. In order to do so, the SimpleChatStore and ChatMemoryBuffer functions were used to enable sav- ing chats and loading the previous ones. SimpleChatStore takes part in retrieving and formatting the history, creating a prompt that includes both the history and new input, and then using the LLM to generate a response based on this combined context. This ensures the model can provide contextually relevant responses, main- taining the flow of the conversation. ChatMemoryBuffer allocates an amount of memory for saving user input and the model’s response. This process is done via the number of token limit which is set to 2000. This number indicates the amount of text saved in memory. If the amount of saved text exceeds this number, the new latest chats will replace the preliminary chats. 3.4 Evaluation For evaluation of the open-source model and the RAG pipeline, the RAGAS frame- work [16] was used. This process was done through multiple steps. A demonstration of the pipeline is shown in Figure 3.2. Figure 3.2: Evaluation pipeline 3.4.1 Proprietary LLM In order to get a sense of the performance of open-source LLM compared to a proprietary one, GPT-3.5 Turbo was used as the proprietary model. Developed by OpenAI, GPT-3.5 Turbo has 175 billion parameters and is known for its ability to perform few-shot learning tasks [25]. The embedding model for it was text- embedding-ada-002, which is the default embedding model for OpenAI LLMs. 14 3. Method Microsoft Azure AI Machine Learning Studio was used for accessing these models through the API. The temperature was set to 0.70 and top P was set to 0.95, which were the default values for this model in the platform. 3.4.2 Synthetic dataset generation In order to have a ground truth to compare with the results generated by the de- veloped model, a synthetic dataset consisting of 100 questions from the in-house documents along with their ground truth answers and relevant context, were gener- ated. This method was used in order to avoid spending significant time in manually creating the answers to each question, as well as the possible errors that a human made answer could contain. It is worth mentioning that the original test set con- sisted of 400 questions, but the RAGAS evaluator was only able to generate ground truths for 100 of them. For this dataset generation, GPT-3.5 Turbo was used both as the generator LLM for generating the synthetic dataset and the critic LLM for evaluating the results of the RAG pipeline. To generate different kinds of easy to hard questions from the dataset, three evolution types were used as follows [26]. 1. Simple: These questions were simple and basic 2. Reasoning: These questions were written in a way to enhance the need of the model for reasoning in order to give the correct answer 3. Multi-context: These questions were written in a way to necessitate retrieving information from multiple related sections or chunks to give a correct answer The distribution percentage of the generated questions included 50 % simple ques- tions, 25 % reasoning questions and 25 % multi-context questions. A data cleaning was done afterwards in order to drop the unnecessary features, such as “evolution type”, “metadata”, etc., and then another feature called “answer” was added to cor- rect the format of the input for the RAGAS evaluator. It is worth mentioning that manual inspection was performed to check the quality of the generated synthetic dataset. Also, five questions and answers unrelated to the in-house dataset along with their ground truths were added manually, in order to see the performance of the model in answering questions that were out of context. 3.4.3 RAGAS evaluator The generated questions were used as input for both Zephyr-7B-β (open-source), as well as GPT-3.5 Turbo (proprietary). The answers generated from these two models along with the ground truth and context generated by the RAGAS syn- thetic dataset, were given to the RAGAS evaluator by the Evaluate function. The evaluator LLM in this step was GPT-3.5 Turbo and the embedding model was text-embedding-ada-002. It is worth mentioning that some of the ground truths and generated answers were also manually checked to make sure the RAGAS eval- uator was working properly. 15 3. Method 3.5 Graphical User Interface A graphical user interface was built using the TKinter library, which is a GUI toolkit. The GUI consists of three steps. 1. First, the user has the option to create a new chat or resume a previous chat. 2. If the new chat option is chosen, the next step is to choose the LLM model, which can be either Zephyr-7B-β or GPT-3.5 Turbo. After the LLM is loaded, you can start interacting with it. The GUI will receive the input, and afterwards, it will stream the generated responses using the stream-chat func- tion. Another feature in the GUI is that upon uploading a new document, it splits the document into chunks of 300 characters, creates vector embeddings, and then adds those to the previous vector indices. After that, the GUI will give a short summary about the added document, indicating the ability of the model in generating automatic insights. 3. If, on the other hand, the previous chat option is chosen, the previous chats will show up and you can choose which one of them you want to resume, from the folder in which the previous chats are stored. The GUI will continue answering new questions using the LLM previously used in the selected chat. A demonstration of the GUI can be seen in Figure 3.3. Figure 3.3: GUI 16 4 Results In this chapter, the evaluation results will be presented and discussed. It is impor- tant to note that the model was deployed on a system equipped with an Intel (R) Core (TM) i7-10850H central processing unit (CPU) featuring 6 cores and 64 GB of random-access memory (RAM). With this hardware configuration, the average response speed of Zephyr-7B-β was 1.26 tokens per second, which is roughly one word per second. 4.1 Related questions Results for the 100 related questions is discussed in this section. Two different approaches have been used for interpreting these results. The first approach is taken to determine which type of questions are more manageable for each LLM based on different evaluation metrics, which allows the user to use the LLM based on the preferred question type. The second approach is taken to to get an overview about the overall performance of each LLM based on the evaluation metrics only and be able to compare the overall performance of these two models. 4.1.1 Results with respect to evolution type The results based on each evolution type for related questions will be discussed in this part. • The faithfulness score for each evolution type is shown in Figure 4.1. It can be seen that for the reasoning evolution type GPT-3.5 Turbo outperforms Zephyr-7B-β, while having almost the same performance for the other two evolution types. This suggests that when there is a need for the LLM to perform some reasoning to achieve an answer, GPT-3.5 Turbo might give out more factual answers. • The answer relevancy score for each evolution type is shown in Figure 4.2. It can be seen that for all three evolution types, Zephyr-7B-β outperforms GPT-3.5 Turbo, indicating its strength in generating related answers re- gardless of the question type. • The context precision score for each evolution type is shown in Figure 4.3. It can be seen that Zephyr-7B-β outperforms GPT-3.5 Turbo in simple evolution types, while having the same performance in the other two evolution types. Therefore, when using the LLM for simple question types, Zephyr-7B- β is slightly a better option in terms of ranking relevant contexts appropriately 17 4. Results high. • The context recall score for each evolution type is shown in Figure 4.4. It can be seen that GPT-3.5 Turbo outperforms Zephyr-7B-β in simple evolution types, while having the same performance in the other two evolution types. Hence, in simple question types, GPT-3.5 Turbo seems to be a slightly better option for retrieving all the relevant contexts with respect to the ground truth. • The answer semantic similarity score for each evolution type is shown in Fig- ure 4.5. It can be seen that in all three evolution types, GPT-3.5 Turbo slightly outperforms Zephyr-7B-β, showing its competitiveness in generat- ing answers that are semantically similar to the ground truth. • The answer correctness score for each evolution type is shown in Figure 4.6. It can be seen that in reasoning evolution types, Zephyr-7B-β outperforms GPT-3.5 Turbo, while having a lower score in simple and multi-context evo- lution types. Therefore, when focusing on reasoning question types, Zephyr- 7B-β is a better option to choose fore achieving more accurate answers with respect to the ground truth. 4.1.2 Overall results for each metric The overall results for related questions based on each metric is shown in Figure 4.7. • It can be seen that GPT-3.5 Turbo outperforms Zephyr-7B-β in faithful- ness, suggesting that it gives more factual answers to given questions. • In terms of answer relevancy, Zephyr-7B-β seems to have considerably bet- ter performance than GPT-3.5 Turbo, indicating its ability to directly and appropriately address the original question. • It can be seen that Zephyr-7B-β slightly outperforms GPT-3.5 Turbo in terms of context precision, while having slightly lower score in terms of context recall. But overall, both models perform well in these two metrics, which indicates that the RAG pipeline provides high precision in ranking relevant items in the context and retrieving the most aligned ones. • In terms of answer semantic similarity, GPT-3.5 Turbo slightly outperforms Zephyr-7B-β. But both models have high score for this metric, which indi- cates their ability in performing semantic search. • Zephyr-7B-β outperforms GPT-3.5 Turbo in answer correctness, which can indicate that it has more accuracy in generating an answer that has factual and semantic similarity with the ground truth. 4.2 Unrelated questions The overall results for the 5 unrelated questions are shown in Figure 4.8. It can be seen that both models perform equally well in terms of faithfulness and answer relevancy. The context precision and context recall metrics are equal to zero for both models. This is due to the fact that these questions are unrelated to the in- house documents, and therefore, there are no related context in the dataset that can be retrieved by the model. In terms of answer semantic similarity, it can bee seen that GPT-3.5 Turbo slightly outperforms Zephyr-7B-β, while having a slightly 18 4. Results lower score in answer correctness. This indicates that GPT-3.5 Turbo excels in transferring the general meaning of each answer, while Zephyr-7B-β is better in generating precise and detailed answers. Figure 4.1: Faithfulness Figure 4.2: Answer relevancy 19 4. Results Figure 4.3: Context precision Figure 4.4: Context recall 20 4. Results Figure 4.5: Answer semantic similarity Figure 4.6: Answer correctness 21 4. Results Figure 4.7: Overall results for the related questions Figure 4.8: Overall results for the unrelated questions 22 5 Conclusion In this chapter, the research questions are answered based on the obtained results. 1. The first research question was to see if the current state-of-the-art LLM tech- nologies can be used to improve the research workflows in the automotive safety domain by enhancing productivity or quality of work. According to the evaluation results and high scores in metrics such as faithfulness, answer relevancy and answer correctness, it can be seen that choosing the right LLM along with proper prompting and modifications, can lead to achieving factual and accurate answers for each question. It can be seen in the results that even for complex question types such as multi-context ones that require the model to retrieve the relevant context from multiple sources and documents, the met- ric scores are still high. This shows the reliability of such question answering system and can yield promising performance in enhancing productivity in do- main specific research. By using LLMs in a correct manner, one is able to search efficiently among documents, draw effective conclusions for each ques- tion, and get important information from one or multiple documents. This can significantly boost the productivity and quality of research work flows. Getting quick summaries from long documents to save time, drawing main takeaways from a study, and generating new ideas based on existing research, are just a few examples of productivity applications of these models. 2. The second research question was to check if exploiting zero/few shot capabili- ties of LLMs, for example RAG, is a viable approach. According to the results, it can be seen that the the developed RAG pipeline has achieved high scores in retrieval-related metrics, such as answer relevancy, context precision and context recall. This indicates that the RAG approach is successful in retriev- ing the relevant information, and therefore, eliminates the need for fine-tuning the model in order to get accurate answers. The RAG pipeline’s ability to retrieve relevant information ensures that the generated responses are based on accurate and up-to-date information. It can be seen that instead of ad- justing the model’s parameters on the in-house dataset, the model can rely on dynamically fetched information. This approach ensures that the responses are not only accurate but also contextually relevant, reflecting the most recent and relevant data available. The RAG approach allows the model to handle a wide range of queries by using diverse data sources. This flexibility means the model can dynamically adapt to different topics without requiring specialized training for each one, which makes it a viable approach to take when using LLMs for domain-specific data. 3. The last research question was to examine if open-source LLMs provide com- 23 5. Conclusion petitive performance compared to proprietary LLM models. As seen in result section, Zephyr-7B-β showed competitive performance compared to GPT- 3.5 Turbo. It can be concluded that by choosing the right open-source model, we can obtain results as good as results generated by a proprietary one. This enables users to leverage the power of LLMs in their research process, without having to pay for licence fees and API keys required in proprietary models like GPT-3.5 Turbo. Also, open-source models are more transparent, since their source code is accessible to the public. This will give the users the opportunity to explore more with the model’s architecture and its different weights in or- der to customize it towards their specific tasks. Moreover, open-source models come with privacy benefits of local deployment compared to the alternative closed source commercial solutions, which can make them a better option for companies using sensitive or private data, allowing them to run the model on-device with satisfactory runtime performance. 24 6 Future Work The main constraint faced during this project was the limited amount of time avail- able, since running LLMs is a time consuming process and this prevents us from exploring all the possible details. In this chapter, various areas where improvements can be made for future research will be discussed. 1. One possible modification that may improve the performance of the current model is to fine-tune it on the in-house dataset. Although the RAG pipeline has been proven to be an effective approach for using LLMs on in-house datasets with lower initial cost, there are still some benefits to fine-tuning, such as its higher accuracy in giving precise outputs [13]. Therefore, mixing fine-tuning and the RAG method can be an interesting approach to take. 2. One other area of enhancement is to explore more with LLM’s parameters. One of the main benefits of using open-source LLMs was their transparency which gives everyone the opportunity to tune the model’s parameters in their own way. Hence, by having enough time, it would be beneficial to make use of this opportunity and experiment more with the parameters to monitor how they can lead to achieving better results. 3. Another thing to consider for future research, would be to use vector databases as the vector store in the RAG pipeline. In this research, the vector embed- dings has been stored on the local computer, which made the process very time-consuming. By using vector databases as an alternative, one would be able to decrease the processing time and improve the semantic search abil- ity [27]. Thus, it would be a good idea to try them out and see how they can affect the results of this project. 4. One other enhancement could be to scale up the synthetic dataset size to get a more accurate picture of the performance. 5. Lastly, instead of using GPT-3.5 Turbo as the proprietary model for compar- ison, more advanced models can be used, such as GPT-4 and GPT-4o. This will alow one to get a better sense of Zephyr-7B-β’s competence compared to proprietary models. Moreover, as the critic model used on the evaluation part, more advanced models such as GPT-4 or GPT-4o can be used for getting more accurate evaluation results. 25 6. Future Work 26 Bibliography [1] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dan- gers of stochastic parrots: Can language models be too big?” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 610–623. [2] N. T. Heffernan and C. L. Heffernan, “The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching,” International Journal of Artificial Intelligence in Education, vol. 24, no. 4, pp. 470–497, 2014. [3] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,” arXiv.org, 2023. [Online]. Available: https://arxiv.org/abs/2307.06435 [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv.org, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762 [5] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, and M. Zhang, “Towards general text embeddings with multi-stage contrastive learning,” arXiv.org, 2023. [Online]. Available: https://arxiv.org/abs/2308.03281 [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013. [Online]. Available: https://arxiv.org/abs/1301.3781 [7] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint, 2018. [8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [9] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901. [10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020. [11] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,” arXiv.org, 2024. [Online]. Available: https: //arxiv.org/abs/2402.07927 27 https://arxiv.org/abs/2307.06435 https://arxiv.org/abs/1706.03762 https://arxiv.org/abs/2308.03281 https://arxiv.org/abs/1301.3781 https://arxiv.org/abs/2402.07927 https://arxiv.org/abs/2402.07927 Bibliography [12] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” arXiv.org, 2020. [Online]. Available: https: //arxiv.org/abs/2005.14165 [13] A. Balaguer, V. Benara, R. L. De Freitas Cunha, R. De M Estevão Filho, T. Hendry, D. Holstein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes et al., “Rag vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture,” arXiv.org, 2024. [Online]. Available: https://arxiv.org/abs/2401.08406 [14] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” arXiv.org, 2020. [Online]. Available: https://arxiv.org/abs/2005.11401 [15] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2312.10997 [16] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,” arXiv.org, 2023. [Online]. Available: https://arxiv.org/abs/2309.15217 [17] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, V. W. Leandro, C. Fourrier, N. Habib et al., “Zephyr: Direct distil- lation of lm alignment,” arXiv (Cornell University), 2023. [18] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. De Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv.org, 2023. [Online]. Available: https://arxiv.org/abs/2310. 06825 [19] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” arXiv.org, 2023. [Online]. Available: https://arxiv.org/abs/2306.05685 [20] Tatsu-Lab. Github - tatsu-lab/alpaca-eval: An automatic evaluator for instruction-following language models. human-validated, high-quality, cheap, and fast. [Online]. Available: https://github.com/tatsu-lab/alpaca_eval [21] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09751 [22] S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023. [23] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” arXiv.org, 2022. [Online]. Available: https: //arxiv.org/abs/2210.07316 [24] J. Liu. Llamaindex. [Online]. Available: https://github.com/jerryjliu/llama_ index [25] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv.org, 2023. [Online]. Available: https://arxiv.org/abs/2303.18223 [26] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to 28 https://arxiv.org/abs/2005.14165 https://arxiv.org/abs/2005.14165 https://arxiv.org/abs/2401.08406 https://arxiv.org/abs/2005.11401 https://arxiv.org/abs/2312.10997 https://arxiv.org/abs/2309.15217 https://arxiv.org/abs/2310.06825 https://arxiv.org/abs/2310.06825 https://arxiv.org/abs/2306.05685 https://github.com/tatsu-lab/alpaca_eval https://arxiv.org/abs/1904.09751 https://arxiv.org/abs/2210.07316 https://arxiv.org/abs/2210.07316 https://github.com/jerryjliu/llama_index https://github.com/jerryjliu/llama_index https://arxiv.org/abs/2303.18223 Bibliography follow complex instructions,” arXiv.org, 2023. [Online]. Available: https: //arxiv.org/abs/2304.12244 [27] Z. Jing, Y. Su, Y. Han, B. Yuan, H. Xu, C. Liu, K. Chen, and M. Zhang, “When large language models meet vector databases: A survey,” arXiv.org, 2024. [Online]. Available: https://arxiv.org/abs/2402.01763 29 https://arxiv.org/abs/2304.12244 https://arxiv.org/abs/2304.12244 https://arxiv.org/abs/2402.01763 DEPARTMENT OF Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms List of Figures List of Tables Introduction Background Motivation Objectives Research questions Theory Natural language processing Transformers Text embeddings Large language models Types Parameters Techniques for using LLMs Prompt engineering Fine-tuning RAG Evaluation Faithfulness Answer relevancy Context precision Context recall Answer semantic similarity Answer correctness Method Open-source LLM Embedding model RAG pipeline Data indexing Retrieval and Generation Evaluation Proprietary LLM Synthetic dataset generation RAGAS evaluator Graphical User Interface Results Related questions Results with respect to evolution type Overall results for each metric Unrelated questions Conclusion Future Work Bibliography