Autonomous Drug Design with Reinforcement Learning Automatization and decision making within a machine learn- ing framework for drug discovery Master’s thesis in Computer science and engineering Filip Edvinsson Victor Jonsson Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Autonomous Drug Design with Reinforcement Learning Automatization and decision making within a machine learning framework for drug discovery and selection Filip Edvinsson Victor Jonsson Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Autonomous Drug Design with Reinforcement Learning Automatization and decision making within a machine learning framework for drug discovery Filip Edvinsson Victor Jonsson © Filip Edvinsson, Victor Jonsson, 2023. Supervisor: Morteza Haghir Chehreghani, Department of Computer Science and Engineering, Chalmers University of Technology Advisor: Hampus Gummesson Svensson, AstraZeneca and Department of Computer Science and Engineering, Chalmers University of Technology Examiner: Alexander Schliep, Department of Data Science and AI, University of Gothenburg Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2023 iv Autonomous Drug Design with Reinforcement Learning Automatization and decision making within a machine learning framework for drug discovery Filip Edvinsson Victor Jonsson Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract The drug design process is currently one of manual trial and error, where potential drug candidates are proposed by chemists, synthesized in laboratories, and then tested and analyzed for properties and efficacy. This process, also called the Design- Make-Test-Analyze (DMTA) cycle, is repeated until a satisfying drug candidate is reached. Statistical models to sample the chemical space and generate potential molecules, combined with automated laboratories and machine learning allows for the automatization of the DMTA-cycle. However, there is still a need for improve- ment and this is where our project comes in. One way to improve the automatization of the DMTA-cycle is to reduce the num- ber of cycles needed, and our aim was to achieve this by improving the selection of compounds. To do this, we developed two deep reinforcement learning algorithms, Deep-Q Network (DQN) and Double Deep-Q Network (DDQN), and compared these to two baseline selection algorithms. This approach was chosen as it translates well into the drug development field. Reinforcement learning in drug discovery works by exploring the proposed molecules to find potential candidates and selecting the most promising ones based on molecular similarity to some predetermined properties. Ultimately, the project was unsuccessful. The baseline selection algorithms using random and greedy selection approaches proved more efficient and accurate than the two algorithms we developed. The involvement of reinforcement learning agents when selecting compounds seemed to cloud the generative model’s understanding of what constitutes a good molecule, and thereby reduced the quality of proposed molecules for both the implemented selection algorithms. However, we found that the DQN algorithm shows some signs of promise and can, with some fine-tuning, potentially be brought up to par with the baseline selection algorithms, and perhaps even surpass them. Keywords: Drug discovery, drug design, design-make-test-analyze cycle, dmta-cycle, machine learning, deep reinforcement learning, deep Q-learning v Acknowledgements We would like to give a special thanks to AstraZeneca for providing us with the equipment and computational resources needed to complete this project, and for allowing us to use their on-site facilities in Mölndal. Further, the biggest of thanks goes to Hampus Gummesson Svensson for your un- wavering support and encouragement. Without you, none of this would have been possible. We would also like to extend a big thank you to Morteza Haghir Chehreghani for always making sure we were on track and fulfilling the academic purposes of this project. Lastly, thank you to Alexander Schliep for all your constructive feedback, and espe- cially for your understanding when we ran into some issues early in the project. Filip Edvinsson, Victor Jonsson, Gothenburg, January 2023 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 DMTA-cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Make . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Theory 9 2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Exploration & Exploitation . . . . . . . . . . . . . . . . . . . 10 2.1.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Molecular Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 SMILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Molecular Fingerprints . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Fingerprint Types . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Molecular Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 QSAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.2 Diversity Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.3 Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.4 Reinforcement Learning in REINVENT . . . . . . . . . . . . . 18 3 Methods 21 3.1 DMTA-cycle in-silico: Framework . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 ix Contents 3.1.3 Make . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.4 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.5 Analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 State Representation . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Explore/Exploit Function . . . . . . . . . . . . . . . . . . . . 25 3.2.4 Reward Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.5 Molecule Selection . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.6 Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Results 29 4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Baseline Observations . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Deep Q-Learning Selectors . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Deep Q-Learning Selectors Results . . . . . . . . . . . . . . . 32 4.3.2 Deep Q-Learning Selectors Discussions . . . . . . . . . . . . . 34 4.4 Comparisons and Discussions . . . . . . . . . . . . . . . . . . . . . . 37 5 Conclusions 43 x List of Figures 1.1 DMTA-cycle visualization. . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Example of traversal in chemical space. . . . . . . . . . . . . . . . . . 3 1.3 Example of the difference between small and large molecules. . . . . . 4 2.1 Overview of the reinforcement learning cycle. . . . . . . . . . . . . . . 9 2.2 Illustration of the difference between Q-learning and deep Q-learning. 12 2.3 Chemical structure of toluene. . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Example of a 10-bit substructure fingerprint. . . . . . . . . . . . . . . 15 2.5 The REINVENT reinforcement learning loop. . . . . . . . . . . . . . 18 3.1 DMTA-cycle with selection step, visualization. . . . . . . . . . . . . . 21 4.1 Greedy and random selectors with novelty on. . . . . . . . . . . . . . 30 4.2 Greedy and random selectors with novelty off. . . . . . . . . . . . . . 31 4.3 Greedy and random selectors where they select from 10000 compounds. 31 4.4 DQN and DDQN selectors with novelty on. . . . . . . . . . . . . . . . 33 4.5 DQN and DDQN selectors with novelty off. . . . . . . . . . . . . . . 33 4.6 DQN and DDQN selectors where they select from 10000 compounds. 34 4.7 Three individual runs using the DQN selector. . . . . . . . . . . . . . 36 4.8 Three individual runs using the DDQN selector. . . . . . . . . . . . . 36 4.9 All selectors with novelty on. . . . . . . . . . . . . . . . . . . . . . . . 38 4.10 All selectors with novelty off. . . . . . . . . . . . . . . . . . . . . . . 39 4.11 All selectors where they select from 10000 compounds. . . . . . . . . 39 xi List of Figures xii List of Tables 2.1 Example of a Q-table. . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 REINVENT settings used. . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Settings used for RL in the selection step. . . . . . . . . . . . . . . . 25 xiii List of Tables xiv 1 Introduction 1.1 Background The drug design process within the pharmaceutical industry is, both historically and presently, one of trial and error, and for which a Design-Make-Test-Analyze (DMTA) cycle can be used as a development approach [1]. In the context of drug discovery, the design-step means proposing one or several molecules, which are then selected in the selection process to be synthesized in the make-step. Next, proper- ties and efficacy are tested in a series of biological assays, after which results are analyzed and improvements made before starting the next iteration. This cycle is repeated until reaching a satisfying drug candidate, called a lead, that can safely be used for its intended purpose. This lead will then go through optimization and, further down the line, clinical trials resulting in an eventual end product [2]. While there have been some improvements to the DMTA-cycle’s efficiency over time, the procedure is still both costly and time consuming. Costs are often estimated be- tween $500 million and $2 billion, with an estimated average time from the start of clinical testing to regulatory approval of 7.2 years [3–6]. Two newly emerged paradigms within the pharmaceutical industry show great promise in speeding up the drug design process: AI-augmented molecular design and automa- tion of the DMTA-cycle. AI-augmented molecular design uses generative models to sample the chemical space, containing upwards to 1060 drug-like molecules, in or- der to select possible molecules [7, 8]. A generative model is a statistical model of the joint probability distribution of some given data, and can generate new data instances by utilizing this distribution. Speeding up the DMTA-cycle through au- tomation uses a combination of automated laboratories and machine learning to design, make, test and analyze molecules without the need for human action [9, 10]. This project will primarily focus on the development of a framework for automation of the DMTA-cycle through use of reinforcement learning (RL). RL is an area of machine learning where desirable behaviors are rewarded and undesirable ones are punished. In general, a reinforcement learning agent learns useful actions through trial and error by receiving rewards when interacting with its environment. Ap- plying this concept to drug design, the agent could reward the use of a promising molecule while punishing the selection of a less desirable one. Before going more in-depth with reinforcement learning, let’s take a closer look at what happens during the individual stages of the DMTA-cycle, visualized in Figure 1.1, and how they are 1 1. Introduction affected when mixed with AI-augmented molecular design and automation. Figure 1.1: A visualization of the Design-Make-Test-Analyze (DMTA) cycle. AI-augmented molecular design is used to generate molecules in the design-step. The most promising ones are then synthesized (in silico1) in the make-step, which will be simulated in a simplified scenario where we assume everything can be synthesized and has equal cost. The aim of this project is to speed up the DMTA-cycle by improving the selection of molecules passed to the make-step. In the test-step, the suggested molecules are evaluated based on what is believed to constitute an efficient compound. This data is then fed to the analyze-step where the newly acquired data is used to update the generative model with the hope of steering it towards a promising area of the chemical space. Hopefully, the autonomous drug design framework will minimize the required number of DMTA-cycles, and thereby enabling a more efficient way of finding drug candidates, ultimately providing new drugs for unmet medical needs faster to the benefit of patients worldwide. 1.2 DMTA-cycle Before properly introducing the selection process, it is worthwhile to go into more detail in what happens during the different stages of the DMTA-cycle. 1.2.1 Design Traditionally, the design-step consists of chemists manually designing and select- ing compounds that seem to be of interest as a potential candidate for a given purpose [10, 11]. The idea behind an in-silico creation stems from the recent ad- vancements made in computational speed and machine learning, allowing computer models to generate large amounts of potential molecules. These models are called generative models [12, 13]. A generative model is a statistical model of a joint prob- ability distribution of a given data set that can generate new data instances [14]. The chemical space is vast, and although virtual screening libraries are becoming 1In silico is a term for something being conducted by computer simulation 2 1. Introduction enormous, they correspond to a minuscule proportion of chemical space. By gener- ating compounds in a directed manner using de novo design, computational practi- tioners hope to traverse chemical space more effectively, reaching optimal chemical solutions while considering fewer molecules than allowed by brute-force screening of large chemical libraries [7]. Figure 1.2 is an example, showing the difference in traversal of chemical space between virtual screening of large pre-existing chemical libraries and by an effective de novo molecular program. The color of the plane shows the quality of the molecular property landscape, with yellow being poor and purple being good. A dot signifies a molecule being considered. Figure 1.2: A visual example of what the difference in traversal of chemical space between virtual screening of large pre-existing chemical libraries (a) and by an effective de novo molecular program (b) could look like. De novo molecular design has recently been called generative chemistry, due to the increased use of generative models in AI. The process of automatically proposing novel chemical structures to optimally satisfy a desired molecular profile is known as de novo molecular design. This is often done in drug discovery to create a molecule that will provoke a desirable biological response while also possessing acceptable pharmacokinetic properties [7]. REINVENT is a well performing example [15] of a reinforcement learning AI tool for de novo drug design which utilizes generative models to create and present a large number of compounds [16]. 1.2.2 Make The make-step is where the molecules suggested in the design-step are synthesized in laboratories, a process requiring expertise, equipment, time and money [4]. Still, an issue lies with the uncertainty of efficacy in the created compounds. There might be a large number of proposed compounds that one could want to test, so in or- der to reduce time and costs, a few compounds are usually tested in a batch and iterated across a several steps [17]. The knowledge and data obtained from these 3 1. Introduction iterative small batch tests are then used to improve the process in the following DMTA-cycles. Simulation can hopefully improve the make-step by reducing the effort wasted on creating ultimately unsuccessful drug candidates. Simulating the drug design pro- cess can hopefully result in a better understanding of which molecules should later be synthesized in a real laboratory [10]. In this work, the most promising molecules are simulated in a simplified scenario where we assume all molecules can be synthe- sized and have an equal cost. It could be useful to consider the molecules sizes in order to better simulate which molecules show enough promise to be considered for real-life testing. Small molecule drugs are synthetic chemicals that mimic, or are inspired by, natural products. These drugs usually target specific biochemical processes to diagnose, treat, or prevent diseases. In contrast, biologics are molecules that are derived from living organisms. These proteins are often similar to human proteins, making them very effective at treating various conditions [18]. Figure 1.3 is a visual example of the differences between some regular compounds and a polymer, showing the difference between small and large molecules. A polymer is a compound consisting of very large molecules, composed of many repeating sub groups. The large molecule in Figure 1.3 is a biopolymer, which is a natural polymer that is produced from cells of living organisms. To be more specific this biopolymer, is a polypeptide, which consists of chains of amino acids, which are organic compounds that contain amino and carboxylic acid functional groups. Figure 1.3: Three regular compounds and a amino acid chain to visualize the difference between small and large molecules. 4 1. Introduction While we have the potential to create, and simulate, large molecules, we will focus on small molecules as they tend to have attributes and abilities generally sought-after in the field of drug discovery. 1.2.3 Test In the testing step, the compounds created in the make stage get their attributes and abilities tested. The testing of compounds is typically done through the process of laboratory tests and experiments. The cost of this stage is highly dependent on how many compounds are tested, and subsequently on how many DMTA-cycles are needed. In a real world setting, this step is usually costly, taking up approximately 63% of the over US$ 100 billion that the pharmaceutical industry spends yearly on drug development [4]. The key objective of the test-step is to deliver relevant data in a timely manner for all specific project hypotheses [11]. In terms of time investment within the DMTA-cycle, the test-step is generally the one that can be improved the most [3]. The design and analyze steps tend to be fast, while the make-step tends to be slow, often taking weeks to complete the synthesis of the molecules assuming they are sufficiently novel and complex. The test-step can be optimized and streamlined to be fast and predictable, in particular within areas such as vitro data generation, including potency, selectivity and profiling [3]. In the simulated test-step, created molecules are compared to the ground-truth. The ground-truth can be viewed as a set of rules that dictate the quality of any given molecule. It can also be a reference drug used for comparison. This can be as simplistic as a binary or continuous value based on the chemical composition of the molecule, or as complex as an entire mathematical model that tries to predict the physicochemical, biological and environmental properties of a molecule based on its chemical structure [19]. 1.2.4 Analyze The analyze-step is where the results and data gathered in the previous steps is used to optimize the next iteration of the DMTA-cycle. In order for the DMTA- cycle to work most efficiently, data must be rapidly turned into knowledge. The right analysis is essential for learning and making rational decisions that lead back to the design-step and the statement of a new hypothesis. The analyze-step is thus an essential and key part of the DMTA-cycle [11]. The information from the test-step is a valuable insight into which molecules worked well, and which did not. This allows for more informed decisions in the next itera- tion’s design-step. If the DMTA-cycle is performed in-silico, this means feeding the generative model with this information with the hopes that it will output better molecules in the future. 5 1. Introduction 1.3 Selection The selection process is a sub-step within the DMTA-cycle, taking place in-between the design- and make-steps. During this stage, the selection of which molecules attained from the design-step are to be brought forward to the make-step is made. In the past, this was usually done by identifying molecules similar to those already tested and having shown promise. Due to both time and cost limitations, only a smaller amount of molecules could be selected this way, and there was therefore a large reliance on previous test data [4, 20]. Most of the disadvantages from the past still apply today, albeit they usually surface later in the drug discovery process. In today’s modern world, we have the ability to simulate this stage. This can be done in multiple ways, with varying complexity. In the most basic form, an algorithm that simply simulates random choice selects molecules to bring forward at random. In a more advanced version, an algorithm that ranks data from the test- step in the previous DMTA-cycle and selects molecules based on that data could be used. An even more advanced selection algorithm could utilize machine learning to have the machine learning agent learn which molecules to select based on results from previous choices and test results. This is the part we have decided to be the major focal point of our project. We plan to implement two new selectors based on reinforcement learning (RL). 1.4 Aim The goal of the project is to improve the automation of the DMTA-cycle. In more detail, the goal of the project is to improve the selection of compounds proposed by the design-step, which are later fed to the make-step, through the use of rein- forcement learning. Even more specifically, we aim to implement selectors based on Deep Q-Network (DQN) and Double Deep Q-Network (DDQN), both utilizing Deep Q-learning to improve on the basic form of RL2. In particular, we will investigate the following questions: • Is the DQN and/or the DDQN selection method an improvement over the baseline methods? • How does the implemented methods fare in terms of speed when it comes to producing a sufficient result? • Does the amount of selected compounds have an effect on which selection method performs better? • Will a stricter classification of what a novel compound entails affect which selection method performs better? 2These concepts will be introduced and explained in detail in Section 2.1 6 1. Introduction 1.5 Ethical Considerations The main goal of this project is to improve upon the automatization and deci- sion making process of a machine learning framework used for drug discovery at AstraZeneca. The framework is still in an early enough stage that it is unlikely that our contributions will affect drugs that reach consumers in the near future. The framework is also not intended to replace the entire development process, but rather speed it up. Before any drugs reach the market, they would have to undergo rigorous testing, analysis and clinical trials, all greatly reducing the risk of us sug- gesting drugs with adverse side effects. Another ethical consideration worth noting is that this creation might not only be used for its original stated intention. Here, the aim is to improve and speed up the drug development process by automatization. A fully automated system for drug discovery could theoretically be used without any chemical knowledge. If these improvements were to fall into the hands of a malicious user, they could instead be used to facilitate the development of hazardous chemicals that could potentially be used as chemical weapons. This is a case of dual-use dilemma and is not something we wish to see occur, and not something we deem likely to occur either. We intend to make sure our contributions made by this project follow the Ethics Guidelines for Trustworthy AI, published by the European Commission in 2019, in order to lessen the chance of this happening [21]. Trustworthy AI have three components, lawful, ethical and robust. The lawful part controls whether the AI is complying with all applicable laws and regulations, and the ethical part ensures adherence to ethical principles and values. Finally the robust part, which has both technical and social perspectives, tries to ensure that the previous two parts are complied with since AI, even with good intentions, can cause unintentional harm [21]. A sincere attempt to follow these guidelines should form enough of a barrier to hopefully prevent our contributions to the project ending up being misused. Then there are also the usual considerations when working with ML models, where bias can occur if the data set is not collected correctly and with care. With our pre- vious experiences, we have some experience dealing with this issue. This, combined with the fact that we are aiming to follow the Ethical Guidelines for Trustworthy AI, should hopefully be enough to minimize the risk of this being an issue. 7 1. Introduction 8 2 Theory 2.1 Reinforcement Learning The goal of the autonomous drug design framework is to minimize the required num- ber of DMTA-cycles needed to find drug candidates. Several generative models for de novo molecular design exist, capable of proposing large numbers of molecules based on some initial critera, one being REINVENT, as introduced in Section 2.5 [16, 22]. Given the large number of small molecules these models are capable of producing, it is highly important to effectively select the most promising ones and machine learning has shown great promise in this area [23]. As utilizing machine learning for automatization and decision making is a fairly novel concept in the pharmaceuti- cal field, there is a lot of room for exploration in regards to the choice of models [24]. We are looking to investigate the use of reinforcement learning (RL) for drug design. RL is a machine learning technique comprised of an agent and an environment, as seen in Figure 2.1. The agent is the learner and decision maker, and the environment is everything outside the agent itself, with different possible configurations of the environment referred to as states. As seen in Figure 2.1, the agent and environment interact at discrete time-steps, t = 0, 1, 2, ...1. With every time-step, the environment presents the agent with a state, St, and the agent selects an action, At, based on its current knowledge of the given state. The mapping of states to the probability of selecting each possible action in the given state is called a policy, often denoted by π. Figure 2.1: Overview of the reinforcement learning cycle[26]. At the next time-step, t + 1, the agent receives a reward, Rt+1, from some unknown distribution based on the action, At, selected in the previous state, St [25]. The 1RL also supports continuous-time cases, but that is omitted as it is not relevant for this application [25] 9 2. Theory reward is a numeric value that signifies whether the action taken was positive or negative. The sum of all rewards is known as the return, Gt = Rt+1+Rt+2+...+Rt+T , with T being the final time-step. It is important to note that a reward may not be a direct result of the agent’s latest action as it may have been a consequence of a series of actions taken previously [27]. Further, all rewards are not necessarily valued equally. A discount factor, γ, can be used to determine how much the reinforcement learning agent values immediate versus future rewards. The ultimate goal of the agent is to find the optimal policy, i.e., maximizing the cumulative reward [28]. Sutton and Barto describe RL as “learning what to do - how to map situations to actions - so as to maximize a numerical reward signal” [25]. They are careful to note how actions may not only affect immediate rewards, but also the situation itself, and thereby changing subsequent rewards [25]. 2.1.1 Exploration & Exploitation As the reinforcement learning algorithm starts off with no knowledge of its environ- ment, exploration is a very important aspect of every RL implementation. Explo- ration is achieved by selecting random actions to gain information. All states have what is called a Q-value, Q(s, a) (also known as action value or state-action value), which is a measure of the expected reward when the agent is in state s and selects action a. Q(s, a)π is the expected return starting in state s and selecting actions a according to policy π, seen in Equation 2.1. Qπ(s, a) = Eπ{Gt|st = s, at = a} (2.1) Once the algorithm has a better understanding of which actions are beneficial in certain states, it can start using this information to make educated choices on which action to select at a given time. This is called exploitation. The balancing of exploration and exploitation is vital both in terms of model accuracy and efficiency. If the model does not explore enough it might not find the most suitable action, which in drug discovery would be the most promising molecules. If the model explores too much, the model might be too slow and therefore unusable under time- constraints. A common approach to this balancing act is called ε-greedy, showcased in Equation 2.2. An ε-greedy exploration algorithm takes a uniformly random action with probability ε and selects the best action given the current information with probability 1− ε [25, 28], that is action at time(t) = max Qt(s, a) with probability 1− ε any action2(a) with probability ε. (2.2) Exploration becomes less important as the model learns more about its environ- ment, and it is therefore common to use a decaying ε to shift emphasis towards exploitation. Otherwise, the actions taken will be too random and the behavioural policy risks not gaining enough experience with states and actions near the optimal 2Picked uniformly at random from the set of possible actions At 10 2. Theory policy. Furthermore, ε has to be decaying in order for the policy to converge to the optimal policy [28, 29]. Low values of ε allow for deeper exploration, namely exploration that not only con- siders the immediate information gain but also the consequences of an action for future learning [30]. A high ε, on the other hand, helps prevent overspecializa- tion [31]. 2.1.2 Q-learning Q-learning is a model free, off-policy reinforcement learning algorithm with the purpose of learning the value of an action in a given state [25]. Model free means that the algorithm learns by taking actions that are outside the current policy, such as taking a random action. Off-policy means the Q-values are updated using the next state S ′ and the greedy action a′, essentially acting as if the algorithm would follow a greedy policy [25]. Q-learning is an abbreviation of Quality-learning, where Quality signifies the quality of a specific action given a specified state. In practice, it works by initializing a table with n rows and m columns as seen in Table 2.1, representing each possible state and action, respectively.  a0 a1 ... am s0 0.8 0.2 ... 0.2 s1 0.3 0.4 ... 0.9 ... ... ... ... 0.0 sn 0.1 0.8 0.4 0.6  Table 2.1: Example of a Q-table with n states and m actions Every state-action pair is given a certain value signifying how beneficial it is to take an action in a specific state. The Q-table is updated, as per Equation 2.3, as the algorithm learns more about the environment by either exploring or exploiting. Qnew(st, at)←− Q(st, at)︸ ︷︷ ︸ old value + α︸︷︷︸ learning rate ·  rt︸︷︷︸ reward + γ︸︷︷︸ discount factor · maxQ(st+1, a)︸ ︷︷ ︸ estimate of optimal value − Q(st, at)︸ ︷︷ ︸ old value  (2.3) In order to determine how much the model values the newly gained information compared to the old, it makes use of something called learning rate, α. This is a parameter that controls how much the model should change when presented with the estimated error (rt + γ · maxQ(st+1, a) − Q(st, at)) [25]. The algorithm also makes use of the discount factor, γ, to balance immediate and future rewards. A high discount factor means future rewards are highly valued, while a low dis- count factor values immediate rewards higher, as shown in the following equation: γ · maxQ(st+1, a) − Q(st, at). Both these concepts are used in Equation 2.3 when updating the Q-table. Additionally, γ < 1 ensures the rewards are finite [25, 27]. 11 2. Theory 2.1.3 Deep Q-learning One issue with tabular Q-learning is that the Q-table scales with both the state and action spaces, meaning that the table can become unfeasibly large for complex systems. One solution, first successfully realized by DeepMind, is to introduce a neural network to the reinforcement learning algorithm [32]. Instead of a Q-table, a neural network is used to approximate the Q-value function. The neural network takes a state as input and outputs the Q-value of all possible actions. This is known as deep reinforcement learning (DRL) [33–35]. Figure 2.2: Illustrates the difference between Q-learning and deep Q-learning [36]. To determine the error between the observed value and the expected value, a loss function is used. The mean squared error (MSE) is often used as loss function for deep Q-learning. It is defined as MSE = 1 N N∑ i=1 (yi − ŷi)2, where N is the number of data points, y is the value of the observed variable, and ŷ is the expected value. One drawback with deep Q-learning is how the loss function can only be approxi- mated by randomly sampling the distribution. This would imply that samples should be independent and identically distributed. However, when gathering state transi- tion online, meaning that the algorithm uses data as soon as it is available, samples are correlated. E.g., (st, at, rt+1, st+1) is directly followed by (st+1, at+1, rt+2, st+2), and the reward rt+2 is a direct result of action at in state st, as explained in Section 2.1. This correlation will cause the deep neural network to overfit as it will learn from the very specific cases it encounters, and it therefore risks converging to a local minimum. If the model was instead able to learn from disconnected samples, the knowledge acquisition would be averaged out over many previous states. This would make it more likely to find solutions to problems it has not necessarily encountered 12 2. Theory replicas of before, allowing it to possibly converge to a global maximum more of- ten [32]. The problem with correlated samples stems from when the neural network is fed transitions as they are gathered. A solution to this is utilizing a replay buffer where transitions are stored, rather than continuously fed to the network [32]. The neu- ral network can then be trained by randomly sampling from the transitions stored in this buffer using a function approximator, such as stochastic gradient descent (SGD)3. The larger the replay buffer, the less likely sample elements will be corre- lated, but this also increases memory usage and may slow training [32, 37]. In some environments, DQN performs quite poorly due to its tendency to overesti- mate action values [38]. This occurs as it selects the maximum action value as an approximation of the expected action value [39]. One approach to counter this is Double DQN (DDQN), where two function approximators are trained on different samples and are then asked to evaluate the same action. Since they have seen dif- ferent data, it is unlikely they will overestimate the same action [38]. Deep reinforcement learning has achieved many impressive results in the last decade, with one of the most prominent being the state-of-the-art Atari scores achieved by DeepMind in 2015 [40]. These results, however, required tens of thousands of episodes of experience per game [40, 41]. In real-life applications, there is an impor- tance of producing results and reaching conclusions quickly as resources are almost always limited. As the ultimate purpose of this framework is to aid in speeding up autonomous drug development, we will have to optimize our approach to reach the best score within some finite time and budget constraints. These constraints could be simulated by only allowing our models to train for a minimal set of episodes and epochs. An epoch refers to one cycle through the full data available for training. 2.1.4 Optimizers Many problems in the fields of science and engineering can be viewed as the opti- mization of some parameterized objective function, requiring maximization or min- imization of its parameters. A function with differentiable parameters is often op- timized using gradient descent as “the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluat- ing function” [42]. Therefore, the stochastic gradient-based optimization is of high importance in many areas [42]. Many objective functions are comprised of subfunc- tions evaluated with different data, and in these cases optimization can performed by stochastic gradient descent (SGD). Instead of computing the exact gradient, each iteration estimates the gradient based on a single randomly chosen example [42, 43]. Another method for stochastic optimization is Adam, an adaptive optimization al- gorithm requiring only first-order gradients with low memory usage [42, 44]. It was 3Introduced in Section 2.1.4 13 2. Theory largely developed to handle the issue of difficulties tuning the learning rate hyperpa- rameter of SGD, stemming from widely varying magnitues of parameters and adjust- ment requirements during training [44]. Adam proposes adapting learning rates of different parameters automatically, based on estimates of first and second gradients. Although often converging faster and simplifying the tuning of hyperparameters, Adam tends to generalize significantly worse than SGD in certain scenarios [42, 44]. 2.2 Molecular Encoding 2.2.1 SMILES In order to formalize chemistry, an unambiguous and reproducible notation is needed for naming chemical compounds. Such systems were swiftly established only years after the birth of structural chemistry in 1861 [45, 46]. Molecular configurations were identified in text by linearly specifying symbols for molecular segments as connected [46]. However, with computers and chemical knowledge now at a point where enormous amounts of chemical information can be stored and used, a more efficient system is needed for providing relevant information. One of these systems is the Simplified Molecular-Input Line-Entry System (SMILES) [45, 47]. The SMILES format is a specification that represents the 2D chemical structures of molecules in the form of text strings specifically designed for computer use [45, 48]. Figure 2.3 showcases an example of how chemical structure relates to SMILES strings. Figure 2.3: We illustrate different ways to represent the chemical structure of toluene (C7H8), starting with a ball-and-stick model. This model is simplified to a detailed skeleton structural formula, and then simplified further to a simplified structural formula. Lastly, it is transformed into seven SMILES enumerations, where all strings represent the same molecule. This example illuminates a potential issue with the SMILES notation: a single molecule can be represented with multiple different SMILES strings. Toluene (Fig- ure 2.3), with seven carbon atoms, has seven possible SMILES enumerations. Efforts made to consolidate how SMILES are generated resulted in the definition of a canon- ical SMILES, guaranteeing that a molecule corresponds to a single SMILES string [48, 49]. In Figure 2.3, the top SMILES string (Cc1ccccc1) is the canonical one. From this point onwards, SMILES and canonical SMILES will be used interchange- ably unless otherwise stated. 14 2. Theory 2.2.2 Molecular Fingerprints One of the most common molecular abstractions is called molecular fingerprints, where a molecule is converted into a bit vector. This allows for easy comparisons [50]. Fingerprints have been used for a long time in drug discovery, as the representation of molecules in the form of mathematical objects allows for their application in both statistical analysis and machine learning. Their ease of use and the speed at which searches can be performed using them adds to their popularity [51]. There are several types of molecular fingerprints, each depending on how a molecule was converted into a bit string. Most methods use the 2D molecular graph (As seen in Figure 2.4), and are therefore called 2D fingerprints. These are the types of molecular fingerprints covered in this section. Figure 2.4: Representation of a 10-bit substructure fingerprint. The three bits set represent the substructures present in the molecule (circled) [51]. Computational advances have allowed for virtual screening, an in-silico method of searching large databases for bioactive molecules that largely reduces the number of compounds a researcher has to test experimentally [52]. A common approach for screening databases is the ligand-based method, retrieving compounds similar in some way to a single known bioactive molecule used as a starting point [53]. Examining the similarity between two known, complex compounds can be quite challenging. To facilitate this process, some simplification or abstraction is often necessary [51]. 2.2.3 Fingerprint Types Substructure keys-based fingerprints set bits in the bit vector depending on the presence or absence of substructures or specified structural keys in a molecule. This approach can be seen in Figure 2.4, where the bit vector has three set bits as the substructures or structural keys they represent are present in the molecule. This approach is highly useful when molecules are likely to be covered by some known keys [51]. 15 2. Theory Topological fingerprints instead work by examining all different fragments by follow- ing a path making up the molecule up to a specified number of bonds. The possible paths are then hashed to create the fingerprint, which allows for any molecule to be meaningfully simplified into a fingerprint [51, 54]. Circular fingerprints are a type of topological hashed fingerprints, but instead of examining molecular paths, each atom’s environment is examined up to a certain radius. Circular fingerprints are not suitable for substructure searching as they do not capture explicit connectivity. Instead, they are used for full structure similarity queries [51, 54, 55]. The standard circular fingerprints version is the Extended- Connectivity Fingerprints (ECFPs), which is based on the Morgan algorithm [56]. ECFPs represents “circular atom neighborhoods”, which produce fingerprints of varying length with a commonly used diameter of 4 or 6, referred to as ECFP4 and ECFP6, respectively [51, 54]. 2.3 Molecular Novelty The purpose of our automated framework is to suggest potential molecules for drug development, and it is therefore important to create a diverse set of candidate com- pounds. To achieve this, the framework has to ensure suggested compounds differ somewhat from those suggested in previous cycles. In this section, molecular novelty will be defined using the language of set theory [57]. A molecule will be considered novel if dissimilar enough to the previously generated molecules. A commonly used metric for determining the similarity between sets is the Jaccard index, J, defined as J(A, B) = |A∩B| |A∪B| . However, as molecular novelty will be based on set dissimilar- ity rather than similarity, it will instead be computed using the Jaccard distance, Jδ. It is a non-similarity measurement between data sets derived from the Jaccard index [58], defined as Jδ(A, B) = 1− J(A, B) = |A∪B|−|A∩B| |A∪B| . The molecular novelty for each molecule in a set is defined as its average Jaccard distance to all previously created molecules, with the denominator being the total number of created molecules. This is defined by novelty(A, B) = Jδ(A,B) size(B) , where A is the set of compounds to be evaluated and B is the entire set of created molecules. This will return a list where each element corresponds to the average Jaccard dis- tance of a molecule in A, to all molecules in set B. 2.4 QSAR Quantitative structure-activity relationship (QSAR) is a mathematical model for displaying relationships between structural properties and biological activities. Gen- erally, variations in structural properties result in differing biological activities, and these models can therefore be used to predict the physicochemical, biological, and environmental properties of compounds from the knowledge of their chemical struc- ture [59]. QSAR models are useful for in-silico drug discovery as they provide a 16 2. Theory mechanism for prioritizing large quantities of chemicals based on their biological activities, thus alleviating the need for manual testing [59]. QSAR models can be either regression or classification models. A regression model relates a predictor value to the potency of a response variable, and a classification model relates a predictor value to a categorical value of the response variable. A predictor is a physicochemical property of a compound, and the QSAR response variable is some biological activity of the chemicals [60]. 2.5 REINVENT The design-step of the DMTA-cycle is traditionally performed by chemists manually designing and selecting interesting compounds (Section 1.2.1). In order to create an automated in-silico version of the DMTA-cycle, molecules have to be generated automatically. REINVENT is a de novo design tool that has shown promising performance in this area [15, 61]. 2.5.1 Overview REINVENT 3.0 is an open source AI tool for de novo drug design and will be used as our generative model for molecule creation [16]. REINVENT is trained on datasets derived from ChEMBL, a manually curated database of molecules with drug-like properties [62], and is capable of generating compounds in the SMILES format. It is trained by “randomizing” the SMILES representation of the input data. This randomization of the compounds’ representations uses multiple SMILES encodings for the same compound to ensure that the model will learn the grammar rather than memorizing specific strings or substrings. The resulting model will thus show a significantly improved generalization potential and achieve a validity above 99% for the produced SMILES strings. The generative model can be fed with initial information, influencing the type of molecules it generates and thus producing the best molecules based on these criteria. Most de novo drug design tools have three main components: search space (SS), search algorithm, and search objective. A generative model, or search space, typ- ically consists of either distribution learning or goal-oriented generation. Distri- bution learning mostly focuses on generating ideas resembling a particular set of molecules, while goal-oriented generation typically uses search algorithms to sug- gest molecules that satisfies the specific requirements without having to traverse the entire search space. Both cases make use of a scoring function (search objective) to filter results [63]. REINVENT covers both distribution-learning and goal-oriented approaches. A generative model is used as the search space in the goal-directed ap- proach, with reinforcement learning used as search algorithm, and a flexible scoring function is used to form rewards. 17 2. Theory 2.5.2 Diversity Filter The scores metioned in Section sec: Overview can also be regulated by a diversity filter (DF) that punishes redundancy and encourages exploration by rewarding so- lution diversity. As explained in Section 2.3, novelty is highly important for the framework to perform its task effectively, and REINVENT can aid in achieving molecular novelty by providing access to what is called diversity filters. “DF can be regarded as a collection of buckets used for keeping track of all generated scaffolds and the compounds that share those scaffolds” [16]. These filters allow REINVENT to generate solutions that are different from a set of reference values, allowing it to find solutions that differ from previous ones, thereby achieving exploration of the search space [16, 64]. 2.5.3 Scoring Functions Scoring functions are used by REINVENT to guide compound generation towards the desired area of chemical space. They can be tuned by the user to accommodate any given drug discovery project. The user defines which molecular components are of interest, and these are then combined into a composite scoring function, with each component corresponding to a single target property. The feedback from the scoring function is then used in REINVENT’s reinforcement learning loop [64]. 2.5.4 Reinforcement Learning in REINVENT Reinforcement learning (RL, Section 2.1) is used to direct the generative model to- wards areas of the chemical space containing interesting compounds. The user can define a set of requirements reflecting the most important features of the desired compounds, and the reinforcement learning scenario is used to maximize the out- come of a scoring function with relevant parameters [16]. Figure 2.5: The REINVENT reinforcement learning loop [16]. A reinforcement learning structure consists of an actor and an environment. The actor takes an action and receives a reward indicating the quality of the action 18 2. Theory taken [25]. In REINVENT, actions are the steps taken for building token sequences, translating to SMILES strings. The environment is shown as Score Modulating Block in Figure 2.5. The prior used in the environment is a generative model, with identical architecture and vocabulary as the agent, and it samples compounds from a vast area of the chemical space. It acts as a reference point for the likelihood of sampling any given SMILES. The last component in the RL loop is inception, which keeps track of compounds that have scored well and randomly exposes some of these to the agent to help guide its learning[16]. 19 2. Theory 20 3 Methods 3.1 DMTA-cycle in-silico: Framework The developed framework is an in-silico simulation of the DMTA-cycle. The DMTA- cycle consists of four main steps: the design, make, test and analyze steps. There is also a sub-step of interest, the selection step. Figure 3.1 is a visualization of the DMTA-cycle with the selection sub-step included. Figure 3.1: A visualization of the Design-Make-Test-Analyze (DMTA) cycle with the selection sub-step included. 3.1.1 Design Molecular design is traditionally performed by chemists manually designing com- pounds that seem to be of interest (Section 1.2.1). However, as we aim to automate the drug design process, a generative model will be used to imitate this step. The generative model of choice is REINVENT (Section 2.5) and it will be run for each simulated DMTA-cycle to create a set of molecules. In order to select how many molecules to make available to the next stage in the simulated DMTA-cycle, an in- teger variable called n_design is available to the user, which the framework uses to tell REINVENT to regulate how many of the generated molecules to present [16]. REINVENT samples molecules in reinforcement learning epochs, 128 per epoch. The user can specify a variable called n_steps to determine the least number of epochs to run [16]. For all following runs, n_steps is set to 500. 21 3. Methods Molecular diversity is important when trying to produce useful molecules (Section 2.3), and REINVENT has a boolean setting called Diversity Filter (Section 2.5.2) that makes it prioritize diversification of the generated molecules [16]. This option is activated in every run. The settings used by REINVENT to generate molecules in the different runs are presented in Table 3.1. REINVENT settings Setting Run 1 Run 2 n_design 1000 10000 n_steps 500 500 Diversity filter True True Table 3.1: REINVENT settings used in different framework evaluation runs. 3.1.2 Selection The purpose of the selection process is to choose the 100 most promising compounds from the large pool of compounds created by the design step and presented for use in the subsequent DMTA-cycle stages. It is important to note that the same compound cannot be selected multiple times. The hope is that a reduced number of compounds advancing from the design-step to the make-, test-, and analyze- steps will speed up the DMTA-cycle time (Section 1.3). The selection is done using deep reinforcement learning (Section 2.1). Deep RL was chosen as the design-step is capable of generating very large numbers of molecules, and storing these in a Q-table, as would have been the case with tabular Q-learning, is unfeasible. Instead, a deep Q-network (DQN) is used (Section 2.1.3). As DQN:s can suffer from overestimation of action values in some environments, both DQN and double DQN (DDQN)1 will be tested and evaluated. Details on the reinforcement learning implementation used for selection is available in Section 3.2.3. 3.1.3 Make The make-step of the DMTA-cycle is the stage where molecules are normally syn- thesized in a laboratory (Section 1.2.2). This step is fully simulated in the frame- work which provides a setting where the user can specify the probability that any molecules is successfully made. This value will be set to 1 for all runs. This will cause the make-step to have a 100% success rate, which is useful as we do not want to accidentally lose data that could have been useful for improving the selection algorithm and the generative model. However, a 100% molecule creation success rate is not realistic, and this could therefore be altered to better simulate real life where not every molecule may be possible to create. 1Introduced in Section 2.1.3 22 3. Methods Further, restrictions based on molecular size could have been interesting to impose as large molecules might be difficult to create in a laboratory. The production process of chemical drugs is fairly well defined and streamlined, partly due to the larger number of competitors in the field, but also because of the often low number of ingredients required, which allows them to be created in large quantities [65]. Biologics (Section 1.2.2), however, are much more complicated to produce and production tends to yield small quantities. They are also very sensitive to physical conditions such as temperature and light [66]. 3.1.4 Test The selection algorithm will select molecules based on the scores assigned in the test- step. Two different ground-truths are used for scoring molecules to gain a better understanding of how the reinforcement learning models interpret the environment and if they behave differently under different scoring regiments. In one of the runs, the ground-truth used for scoring the selected molecules is based on both the princi- ples of efficacy and novelty, while the other only uses efficacy. In both cases, binary scoring is used, with scores being either 1.0 or 0.0. First, the molecules are evalu- ated based on their novelty (in cases where novelty is used as a metric), with each molecule being scored based on its average Jaccard distance to all previously created molecules, as seen in Equation (3.1) and explained in Section 2.3. novelty(A, B) = |A∪B|−|A∩B| |A∪B| size(B) (3.1) 5.5 ≤ carbon oxygen ≤ 5.67 (3.2) 7 ≤ carbon nitrogen ≤ 7.39 (3.3) 1.18 ≤ oxygen nitrogen ≤ 1.34 (3.4) Any molecule with a novelty score > 0.3 is evaluated for its efficacy using Equa- tions (3.2), (3.3) and (3.4) (cases not considering novelty skip straight to this step). These determine whether the molecular composition is deemed promising for the given task by looking at the ratios of carbon, nitrogen and oxygen molecules. If the compound has non-zero counts of at least two of these atoms, and the ratios of the non-zero counts are fulfilled by the equations previously mentioned, the compound is deemed active and given the score of 1.0. Otherwise, it is given the score of 0.0. After computing the scores, each molecule’s score, along with their corresponding SMILES and fingerprint, are sent to the analyzer. 23 3. Methods 3.1.5 Analyze The analyzer receives molecule scores with corresponding SMILES strings and fin- gerprints from the test-step. These are used to evaluate the selected molecules’ performance and dictates how the framework should be updated for the next itera- tion. Molecule performance is based on a comparison between the assigned scores, as described in Section 3.1.4, and a target score set by the user. As the test-step uses binary scoring for molecules, the target score value will be set to 1 for all runs. It also provides a setting which controls whether the framework stops after the molecules reach the threshold value. This setting will be turned off for all runs as we wish to continue training until reaching the desired number of DMTA-cycles (500). Compounds reaching the threshold are then stored to be used for updating the design-step, giving REINVENT a better understanding of what constitutes a useful compound, hopefully allowing it to generate more effective ones in subsequent DMTA-cycles. The next step is to update the QSAR model used by REINVENT which is based on scikit-learn, a machine learning library for Python, to include the newly tested compounds [67–69]. This is done using fingerprints and test scores of the molecules selected and made. All previously tested compounds’ fingerprints are concatenated with the ones collected over the latest DMTA-cycle simulation, and the same pro- cedure is performed for the test scores. The QSAR model is then re-fit using all concatenated data. 3.2 Reinforcement Learning 3.2.1 General The reinforcement learning algorithm uses two neural network models with iden- tical structure, both implemented with PyTorch [70]. Two network models, a pri- mary_model, Q, and a secondary_model, Q′, are used to allow decoupling of the action selected and the target Q-value generation. The DQN algorithm uses only Q, while the DDQN algorithm uses both Q and Q′. For each model, the input layer has 2048 input features and 1024 output features. The first hidden layer has 1024 input features and 512 output features, and the second hidden layer has 512 input features and 256 output features. The output layer has 256 input features and 1 output feature. The input layer has 2048 input features as that is the amount of bits used to represent a molecule’s fingerprint, and the output layer has 1 output feature which represents a molecule’s q-value. 3.2.2 State Representation The state for each DMTA-cycle iteration is represented by the number of molecules most recently generated by REINVENT, and a 2048 bit representation of their cor- responding fingerprints. In an effort to evaluate how the RL selection algorithm’s 24 3. Methods performance depended on the number of input molecules, two different run setups were investigated. The design-step generated 1000 molecules in the first trial and 10000 in the second. The networks are fed with the entire state-space, consisting of either 1000 or 10000 molecules with 2048 features each. The RL network’s input vector thus had the dimensions 1000 ∗ 2048 in the first version and 10000 ∗ 2048 in the second. The network then outputs the Q-values for each molecule, with the output vectors having the dimensions 1000 ∗ 1 or 10000 ∗ 1. The difference in se- lection performance between these settings is covered in Chapter 4. To ensure the RL networks received the same number of input molecules for each DMTA-cycle iteration, mock data comprised entirely of 0’s was used as a complement in cases where REINVENT had created fewer molecules than desired for the current setup. 3.2.3 Explore/Exploit Function As there are a considerable amount of molecules to investigate, it is important that our selection algorithms thoroughly explore the different options. To balance exploration and exploitation, the decaying ε-greedy algorithm (Section 2.1.1) was used with a decay rate of 0.99 and a minimum exploration rate of 0.02. The decay function is shown in Equation 3.5. exploration rate εt = εt ∗ 0.99 if εt ≥ 0.02 0.02 otherwise (3.5) A random selection is made with probability ε, and the algorithm will exploit by selecting the compound with the highest q-value with probability 1−ε. This process is repeated until 100 compounds have been selected. The same compound cannot be selected multiple times. RL settings Settings Value learning rate 0.00025 gamma 0.90 epsilon max 1.0 epsilon min 0.02 epsilon decay 0.99 batch size 16 optimizer Adam loss function MSE Table 3.2: Settings used for RL in the selection step. 25 3. Methods 3.2.4 Reward Setup Two different approaches are used for rewards, as outlined in Section 3.1.4 [25]. In the first, the reward is based entirely on whether the selected molecule lies within the thresholds specified by Equations 3.2, 3.3 and 3.4. If the chemical composition of a compound satisfies these constraints, it is given the score 1.0. Otherwise it is given the score 0.0. In the second approach, molecules are also scored based on their novelty, with novelty computed as per Equation 3.1. If a molecule has an average Jaccard distance to all other molecules of ≥ 0.3, it is deemed novel and is subse- quently subjected to the same test as in the first approach. If its average Jaccard distance is < 0.3, it is deemed not novel enough and given the score 0.0. 3.2.5 Molecule Selection The network models are trained using the fingerprints of the molecules generated in the design step. These are presented to the model which outputs a Q-value for each fingerprint. These Q-values are either explored or exploited according to Equation 3.6, where R is a random threshold value between 0 and 1 and εt is the exploration rate at time t. If the Q-values are explored, one is chosen at random and the molecule corresponding to this value is stored. selected Q-value = random if R < εt highest otherwise (3.6) If the Q-values are instead exploited, the highest available Q-value is selected (and subsequently marked as ineligible to prevent it being selected multiple times), and its corresponding molecule is stored. This process is repeated until 100 molecules have been selected, after which all selected molecules are stored in the experience replay buffer to be used for training. The parameter values used for molecule selection and in experience replay are shown in Table 3.2. 3.2.6 Experience Replay The analyzer calls on the selector to update when new test scores are available and the new information is added to the replay buffer. Next, the algorithm selects 16 samples from the buffer to be used for training the RL model with either DQN or DDQN. The DQN and DDQN update functions are shown in Equations 3.7 and 3.8, respectively. The target model Q′, used with DDQN, is updated every third cycle. Q∗(st, at) ≈ rt + γQ(st+1, argmaxa′Q(st, at)) (3.7) Q∗(st, at) ≈ rt + γQ(st+1, argmaxa′Q′(st, at)) (3.8) 26 3. Methods MSE = 1 n n∑ i=1 (Yi − Ŷi)2 (3.9) When the models have been updated, the loss is calculated using mean squared error (MSE). MSE is the mean value of the squared error of the observed values Y and the predicted values Ŷ , defined in Equation 3.9. Next, the gradients are computed, and the error backpropagated using the Adam optimizer, defined in Section 2.1.4. The final step is updating the exploration rate using Equation 3.5. This iteration of the DMTA-cycle is completed, and unless we just completed cycle 500, the process starts again from the design-step and incorporates the newly acquired knowledge. 27 3. Methods 28 4 Results 4.1 Settings All of the results in the following sections were created with REINVENT setups as per Table 3.1 in Section 3.1.1. It covers the different variations of REINVENT settings for the runs we’ve produced, determining whether we run with novelty off, novelty on, and the amount of compounds we select during each selection process. With novelty on, the novelty calculation using Jaccard distance as described in Sec- tion 3.1.4 is utilized. With novelty off, there is no novelty calculation and compounds are not punished for being similar. As these settings vary across different runs, we can expect them to have significant impact on the results presented in the following sections. Table 3.1 also shows that we have the Diversity Filter enabled for all runs which should improve the average scores a decent amount for runs using novelty scoring. Specifics on how the Diversity filter works are found in Sections 2.5.2 and 3.1.1. All the results from the DQN and DDQN selectors will also utilize the RL settings for the selection-step as presented in Table 3.2 in Section 3.2.3. Table 3.2 covers multiple variables used by Equations 3.7 and 3.8, namely learning rate (α), discount factor (γ), epsilon min (εmin), epsilon max (εmax) and epsilon decay (εdecay). Batch size determines the number of samples that the network will use to train at each step. The optimizer refers to the optimization technique for gradient descent, and loss function defines which technique used for calculating loss. The amount these setting will affect the results vary, the variables should affect things a decent amount since they are used for multiple calculations each cycle. The batch size will likely not have a huge effect on the results as they mostly affect the speed and stability of the selector updates, as the batch size determines how much data to be used for training at the same time. The optimizer and loss functions could have a bigger effect, but since they are being kept the same across DQN and DDQN runs it seems unlikely these would cause large differences in selector performance. To summarize, we don’t believe the RL setting in Table 3.2 will have a big effect on the results, largely due to all DQN and DDQN runs using the same settings. 29 4. Results 4.2 Baseline In order to properly evaluate the performance of our selection algorithms, six baseline results were established using greedy and random molecule selection. In the greedy selection, the molecules scored most highly by REINVENT’s own scoring function were selected in each cycle, while the random selection randomly picked molecules created by REINVENT. 4.2.1 Baseline Results Figure 4.1 shows the greedy and random selectors where they each select from 1000 compounds with novelty on. Figure 4.2 shows baseline selectors where they each select from 1000 compounds with novelty off. Finally, Figure 4.3 shows the two selectors where they each select from 10000 compounds with novelty on. All figures show the average results over 10 independent runs, each running for 500 cycles. All figures also use a moving average score over 10 cycles. 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty on - 1000 compounds Greedy Random Figure 4.1: Plots of the greedy and the random selectors where they select from 1000 compounds with novelty on. The scores are computed using a moving average of 10 cycles. 30 4. Results 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty off - 1000 compounds Greedy Random Figure 4.2: Plots of the greedy and the random selectors where they select from 1000 compounds with novelty off. The scores are computed using a moving average of 10 cycles. 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty on - 10000 compounds Greedy Random Figure 4.3: Plots of the greedy and the random selectors where they select from 10000 compounds and with novelty on. The scores are computed using a moving average of 10 cycles. 31 4. Results 4.2.2 Baseline Observations Comparing Figure 4.1 with Figure 4.2 we can see that the greedy selector appears to provide slightly better results, score wise, with novelty on and that the random selec- tor appear to perform slightly better with novelty off. The random selector appears to converge to its maximum score faster than greedy across all three setups, with the greatest difference compared to the greedy selector being with novelty turned on and 1000 compounds used (Figure 4.1). Stability wise they both appear to be quite steady across all three configurations, with both selectors appearing to be the most stable when using 10000 compounds (Figure 4.3). Notably, the random selector barely oscillates off of 1.0 score at all. It is worth noting that the random selector is slightly unstable with novelty on and 1000 compounds used, it reaches 1.0 score at quite a few points but appears to oscillate around 0.90-0.95 with a fairly high frequency. None of the other five results across all three figures appear to fluctuate with such a high frequency. They all have the occasional dip in score, but nothing noteworthy. In terms of speed, all the random selector configurations appear to reach a score of around 0.9 within less than 50 cycles with novelty off (Figure 4.2), and the 10000 compounds version (Figure 4.3) settling around 1.0 score in the same amount of cycles. The greedy selector is slightly slower, reaching a score of about 0.9 within less than 60 cycles with novelty on (Figure 4.1), and with novelty off (Figure 4.2) settling in around a score of 1.0 within the same cycle amount. 4.3 Deep Q-Learning Selectors Two selection algorithms were implemented based on the theory described in Section 2.1.3. The first one is a Deep Q-Network (DQN), which is an RL algorithm that utilizes a neural network to approximate the Q-value function. The second is a Double Deep Q-Network (DDQN) selection algorithm which works similarly to the DQN selector, but also uses a second neural network to better approximate the Q-value function. 4.3.1 Deep Q-Learning Selectors Results Similarly to Section 4.2.1, all these results use the setup as shown in Section 3.2.3. Figure 4.4 shows the results of the DQN and DDQN selectors where they each select from 1000 compounds with novelty on. Figure 4.5 shows results of the two implemented selectors where they select from 1000 compounds with novelty off. Lastly, Figure 4.6 shows the two selectors where they select from 10000 molecules with novelty on. Similarly to Section 4.2.1, the figures all show the average results over 10 independent runs, each running for 500 cycles. All figures also use a moving average of 10 cycles. 32 4. Results 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty on - 1000 compounds DQN DDQN Figure 4.4: Plots of the DQN and the DDQN selectors where they select from 1000 compounds with novelty on. The scores are computed using a moving average of 10 cycles. 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty off - 1000 compounds DQN DDQN Figure 4.5: Plots of the DQN and the DDQN selectors where they select from 1000 compounds with novelty off. The scores are computed using a moving average of 10 cycles. 33 4. Results 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty on - 10000 compounds DQN DDQN Figure 4.6: Plots of the DQN and the DDQN selectors where they select from 10000 compounds and with novelty on. The scores are computed using a moving average of 10 cycles. 4.3.2 Deep Q-Learning Selectors Discussions Both selectors start of well and rise quickly with the initial cycles, although this is most likely the diversity filter’s work, since it will filter all SMILES below a user-set threshold [16]. The DQN selector with novelty on (Figure 4.4) reaches a score of roughly 0.95 after about 60 cycles. After this, it rises slightly to around 1.0 and then hovers around this score. The DDQN selector with novelty on (Figure 4.4) rises faster and reaches 1.0 after about 50 cycles. Both selectors fluctuate a decent amount afterwards, particularly later on for the DQN selector. The fluctuations for the DDQN selector appear both earlier and appear to have larger swings than the DQN selector. The average results in Figure 4.4 indicate the DQN selector perfor- mance appear to be slightly more stable than the DDQN selector when presented with 1000 molecules each cycle. Looking at Figure 4.5 in Section 4.3.1 we can see how the DQN and DDQN se- lectors compare to each other with novelty turned off. The DQN selector is slower to reach a score near 1.0 than with novelty on (Figure 4.4), taking around 90 cy- cles in this case. It also fluctuates a lot more and starts fluctuating earlier, even before the initial big growth of score. The DDQN selector with novelty off appears to be doing significantly worse than with novelty on (Figure 4.4), reaching around 1.0 score at 2 peaks around 200 cycles and 400 cycles in. The DDQN selector is constantly fluctuating with this setup and does not appear to have any significant intervals where it appears to be stable. Comparing the two selectors with novelty 34 4. Results off, the DQN selector appears to be a clear winner. It’s not as stable as with novelty off but it does appear to be stable the majority of the time. The DDQN selector appears to be quite unstable with these settings. It appears that the novelty calcu- lations in the test-step affect the average results more for the DDQN selector than the DQN selector based on these two figures (Figure 4.4 and Figure 4.5). The final figure in Section 4.3.1 is Figure 4.6 which shows the results for the DQN and DDQN selectors with novelty on, but this time selecting from 10000 compounds each cycle, as opposed to the 1000 compounds in the previous two figures in the same section. The DQN selector appears to reach its max score at around the same speed as the previous figures. The max score is however slightly lower, maxing out around 0.95. It also appears to be more unstable than the 1000 compound figures, con- stantly fluctuating with a high frequency and low amplitude. The DDQN selector appears to perform significantly worse when it uses 10000 compounds, compared to with novelty off and using 1000 compounds. It only reaches its max score once (close to 1.0) after around 470 cycles. Before and after that it appears to oscillate violently around 0.5, reaching 0.8 and then dipping down to around 0.2-0.3. The DQN selector oscillates around a higher score (0.9) and with a higher frequency than the DDQN selector. Both selectors appear to perform worse when using 10000 com- pounds, with the performance decrease appearing larger with the DDQN selector. We can also look at and draw comparisons from a few selected runs from each selector, selecting from 1000 molecules with novelty on. The runs were handpicked to more clearly illustrate the difference in performance between DQN and DDQN. In Figure 4.7 we see three DQN selector runs from the selection of runs used to com- pose the DQN plot in Figure 4.4. Three DDQN selector runs used by the DDQN plot in Figure 4.4 can be found in Figure 4.8. 35 4. Results 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e DQN Selection with novelty on. version 1 version 6 version 8 Figure 4.7: Three selected individual runs (version 1, 6, 8) using the DQN selector. The algorithm selected from 1000 molecules and used novelty scoring. 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e DDQN Selection with novelty on. version 2 version 5 version 9 Figure 4.8: Three selected individual runs (version 2, 5, 9) using the DDQN selector. The algorithm selected from 1000 molecules and used novelty scoring. 36 4. Results The DQN selector runs appear to be pretty stable, with the version 6 plot from Figure 4.7 being a bit more unstable until it settles around cycle 100. The DDQN selectors are quite volatile, with some runs experiencing extreme scenarios as can be seen with the version 2 plot from Figure 4.8 where the DDQN selector quickly reaches a score of near 1.0, but after 225 cycles starts experiencing runs scoring near 0. This is likely the reason the composite DDQN figure (Figure 4.4) converges to roughly 0.8, while most other selection algorithms trend toward 0.9. This issue could potentially be explained by what is known as catastrophic forgetting, where a neural network forgets previously learned information upon acquisition of new in- formation [71]. As the chemical space is enormous, it is likely that some molecules may not reoccur for a long time, and in these scenarios, network information re- tention is important while still allowing for acquisition of new knowledge. It might seem as though the experience replay buffer should prevent this. However, when the model has achieved a large number of successes, these experiences will make up the majority of the buffer and the algorithm will start forgetting what failure looks like. It will therefore struggle to predict values for these states. A way of maintaining proficiency on tasks not recently experienced was proposed by Kirkpatrick et al., where old tasks are remembered by slowing down the learning on weights important to those tasks [72]. Resolving this issue in a machine learning framework for drug discovery is something that could yield great benefits, as the DDQN shows great promise with runs potentially quickly converging to scores near 1.0, as can be seen in the runs with dedicated plots in Figure 4.8. A possible explanation as for why DQN does not seem to experience the catastrophic forgetting is due to it using the same network for calculating both the current and target values, whereas DDQN uses two different networks. With two networks, each susceptible to catastrophic forgetting, DDQN may suffer from an increased likelihood of encountering this is- sue. One of the networks is used for calculating the current values, while the other calculates the estimated target values. The difference between these two values is used to calculate the RL model’s loss. If one of these networks were to “forget” what failure looks like, the loss would become large and the model would be unable to properly estimate how good a specific state is. To summarize, the DQN selector appears to score well and is close to 1.0 in most cases, as can be seen by both the figures in Section 4.3.1 and the individual runs in Figure 4.7. The DDQN selector also scores well and is close to 1.0, although it does appear to have some instability issues. It is most notable in Figure 4.5 and Figure 4.6, as well as the version 9 run in Figure 4.8. 4.4 Comparisons and Discussions We are now going to present composite figures of the figures from Section 4.2.1 and Section 4.3.1. This is done to make it easier to draw comparisons between the two baseline selectors (greedy and normal) and the two RL selectors (DQN and DDQN). Since these figures are just composites, they all use the same settings as in Section 4.2.1 and Section 4.3.1. Figure 4.9 shows the results of all the selectors where they each select from 1000 compounds with novelty on. Figure 4.10 shows the 37 4. Results results of all the selectors where they each select from 1000 compounds with novelty off. Lastly, Figure 4.11 shows all the selectors where they each select from 10000 compounds with novelty on. Like the individual figures in Section 4.2.1 and 4.3.1, all the figures show the average results over 10 independent runs, each running for 500 cycles. All figures also use a moving average of 10 cycles. 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty on - 1000 compounds DQN DDQN Greedy Random Figure 4.9: Plots of all four selectors (greedy, random, DQN, DDQN) where they select from 1000 compounds with novelty on. The scores are computed using a moving average of 10 cycles. 38 4. Results 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty off - 1000 compounds DQN DDQN Greedy Random Figure 4.10: Plots of all four selectors (greedy, random, DQN, DDQN) where they select from 1000 compounds with novelty off. The scores are computed using a moving average of 10 cycles. 0 100 200 300 400 500 Cycles 0.0 0.2 0.4 0.6 0.8 1.0 Sc or e Novelty on - 10000 compounds DQN DDQN Greedy Random Figure 4.11: Plots of all four selectors (greedy, random, DQN, DDQN) where they select from 10000 compounds and with novelty on. The scores are computed using a moving average of 10 cycles. 39 4. Results We can now draw some comparisons between the baseline runs in Section 4.2.1 and the two Deep Q-Network Selectors in Section 4.3.1 using these new figures. Start- ing with Figure 4.9, 1000 used compounds with novelty on, the first observation is that the DQN selector appears to perform equally well as the greedy and random selectors from the same figure. It even appears to be getting faster results than the greedy selector. This is not entirely surprising, as the DQN selector utilizes a neural network that learns which molecules score well rather than just picking the ones REINVENT scores highly, as the greedy selector does [73]. What is surprising is how well the random selector performs. One could expect that just randomly picking the molecules created by REINVENT would result in a more random, and in return lower, score than what the greedy and DQN selector can provide as a random selection would at times choose very poorly performing molecules without learning from the negative experience, but this does not appear to be the case. The amount of runs used for our results, i.e., 10 runs, could be the culprit. We are a bit vary of pointing this out specifically though, as more runs of a few select series are available, the DQN selector in Figure 4.9 being amongst them, and we struggle to see a noticeable difference when more runs are included. Looking at the DDQN selector’s result from Figure 4.9 and comparing it to the baseline selectors in the same figure, it leaves more to be desired. We believe that the apparent instability of the DDQN selector is the culprit here. If we look at the individual runs of the DDQN selector in Figure 4.8, we see that the DDQN selector is quick to converge with a score approaching 1.0. However, we can also see the instability over the moving average in Figure 4.8, particularly in the version 2 plot, but also somewhat in the version 5 plot. We believe that this instability is what causes the plot of the average DDQN selector scores in Figure 4.9 to perform significantly worse than the baseline and the DQN selectors in the same figure. Next, we will look at Figure 4.10 with 1000 used compounds and novelty off. The situation for the DQN selector is largely similar to the previous case. It is a bit slower to reach max score, taking around 100 cycles to do so. It is largely stable with a few drops to around 0.8 score, lasting around 20 cycles each until it once again settles around 1.0. The greedy selector exhibits behavior similar to the DQN selector, with roughly the same amount of dips but only dropping to about 0.9. The random selector is once again the best performing selector, being both the fastest to reach the max score and being the most stable. Switching over to the DDQN results from Figure 4.10, we note that it appears to be even more unstable than in Figure 4.9. It only reaches a test score of 1.0 at two points, around cycle 200 and 430. The rest of the time it appears to be oscillating around a test score of roughly 0.6, never really having any stable peri- ods. Our analysis was unable to explain the reason both the DQN and the DDQN selectors perform worse, slightly in the DQN selector case and significantly in the DDQN selectors case. Looking at the two baselines, the random selector performs even better with novelty off (Figure 4.10) with novelty on (Figure 4.9), while the greedy selector performs near the same level for both settings, with perhaps a slight 40 4. Results reduction in performance when novelty is on. One thing we are considering is that the more stringent novelty calculations, i.e., Figure 4.9 with novelty on, cause the selectors to be forced to make better choices. This favors the three selectors that do not pick completely at random, i.e., all but the random selector. Looking at Figure 4.11, where selectors picked from 10000 compounds with nov- elty on, the first observation is that the two baseline selectors’ performances do not differ from the previous two figures (Figure 4.9 and Figure 4.10) to any significant degree. Looking at the DQN selector, it performs significantly worse than the re- sults where it selects from 1000 compounds, regardless of the novelty setting used. It never manages to reach a 1.0 in test score and instead oscillates around 0.9, never truly converging. The DDQN selector is also performing worse when selecting from 10000 compounds compared to 1000 compounds, oscillating around 0.5 score with no stable periods, reaching a max score of 1.0 once at around cycle 460. It behaves largely the same as Figure 4.10 (selecting from 1000 compounds and novelty off), which is interesting because Figure 4.11 does have novelty on. If the opposite had been true, i.e., novelty off when using 10000 compounds, it could have been easy to look at that when considering why we are seeing performance problems with the DQN and the DDQN selector and not with the baselines. If anything, based on our results, the baselines are performing slightly better when selecting more compounds than less. 41 4. Results 42 5 Conclusions Diversity filter makes REINVENT very good at finding novel compounds, with the most notable effect being the how quick all plots in the three figures in Section 4.4 rise toward their test score maximum [16]. The novelty setting appears to fill a sim- ilar role when it comes to the speed on the non-random selectors. Comparing the amount of cycles it takes for the DQN selector to reach its test score maximum with novelty off in Figure 4.10 with the same selector in Figure 4.9 with novelty on shows this effect. It does not appear to have any significant effect on the baseline selectors, if anything it appears to make the random selector perform slightly worse, as can be seen in Figure 4.9. One could argue whether the implemented novelty setting provides any significant improvements over just utilizing diversity filter. We would argue that it does, with the noted improvement of the DQN selector using the nov- elty setting suggesting this. The DDQN selector also performs significantly better with novelty turned on, which becomes apparent when comparing Figure 4.9 (nov- elty on) with Figure 4.10 (novelty off). As both the implemented DQN and DDQN selectors utilize one additional layer of RL in their selection method in comparison to the baselines, it appears that the novelty setting has a notable effect on the speed if the selection method is advanced enough. So, the need for the novelty setting appears to be tied to the complexity of the selection method. If it is sufficiently complex, the diversity filter alone does seem not offer enough of an improvement to the speed of the selector. As we performed many runs where only 1000 of the molecules designed by REIN- VENT were used, random selection might be performing unreasonably well com- pared to other selection algorithms as a smaller pool of available molecules increases the likelihood of it selecting good ones by chance. Therefore, it could be interesting to perform runs using all available molecules to gain a better understanding of how the different selection algorithms compare to one another in large state spaces. The runs using 10000 molecules, in Figure 4.11, provide a glimpse into what this might look like. Still, for 10000 molecules, random selection performs just about identi- cally to greedy. One potential reason for this is that our ground truth, described in Section 3.1.4, is not complex enough. This means that the molecules produced by REINVENT are considered promising too easily, thus favoring a random selection more. A potential counterargument can be attained by looking at Figure 4.11 again, but this time focusing on the other two plots, the DQN and DDQN selector plots. Both the DQN and the DDQN perform notably worse when selecting 10000 molecules, compared 43 5. Conclusions to when selecting 1000 molecules in Figure 4.9, with all other settings being the identical across the two versions. Both of the implemented selectors perform better when selecting 1000 molecules rather than 10000 molecules, with the DQN selector performing approximately as well as the greedy selector. A more complex ground truth could potentially mean that the scores REINVENT assigns to molecules may be a bit less accurate and could therefore consequently make the non-random selec- tors less accurate. A larger selection of molecules could potentially show this effect. The results in Figure 4.11 compared to Figure 4.9 points towards this effect for the implemented RL selectors, but not for the greedy selector. Therefore, the hypothesis that our ground truth is too simple is left unanswered. Both DQN and DDQN fail to achieve the results of greedy and random, with DDQN especially lacking. DQN is near identical in performance to the baselines in runs with 1000 molecules and novelty on (Figure 4.9), showing some signs of promise. DDQN however reaches the same maximum scores as the others, but is prone to massive drops in performance. It may be that REINVENT is so good at suggest- ing molecules that the learning performed by our RL selection algorithm interferes with REINVENT’s own learning. As there are additional steps in DDQN compared to DQN and the baselines, REINVENT’s own understanding of what constitutes a good molecule might get even more clouded. Figure 4.10 seems to support this idea. Neither DQN nor DDQN perform as well without novelty scoring as with it, with DDQN again being significantly worse than the rest. Without novelty scoring, our RL selection algorithm does not reward compounds for being novel. This might cause confusion for REINVENT which uses the diversity filter but is not explicitly rewarded for novelty. This could potentially be cause for uncertainty as to what REINVENT should generate, leading to larger drops in performance than when novelty was rewarded. Both DQN and DDQN are particularly far from the base- line performances in runs with 10000 compounds (Figure 4.11). Here, even more molecules are ran through the RL selection algorithm, further decreasing the purity of REINVENT’s knowledge by polluting it with data passing through a different RL system. Random and greedy both converge to scores near 1 faster without novelty scoring, showing that REINVENT is fully capable of generating useful molecules quickly when the data is not adulterated by another RL system. To summarize, neither of the implemented RL selectors (DQN and DDQN) per- formed well enough to compare with the baseline selectors in any of our test series. The DDQN version performed particularly poorly and we suspect it could be due to data pollution caused by there being to many agents involved, clouding REIN- VENT’s understanding of what constitutes a good molecule. The DQN selector on the other hand shows promise and could with some polish potentially be a viable alternative to the greedy and random selectors. 44 Bibliography [1] Petra Schneider, W Patrick Walters, Alleyn T Plowright, Norman Sieroka, Jennifer Listgarten, Robert A Goodnow, Jasmin Fisher, Johanna M Jansen, José S Duca, Thomas S Rush, et al. Rethinking drug design in the artificial intelligence era. Nature Reviews Drug Discovery, 19(5):353–364, 2020. [2] Amol Deore, Jayprabha Dhumane, Rushikesh Wagh, and Rushikesh Sonawane. The stages of drug discovery and development process. Asian Journal of Phar- maceutical Research and Development, 7(6):62–67, Dec. 2019. [3] Petra Schneider, W. Patrick Walters, and Alleyn T. et al. Plowright. Rethinking drug design in the artificial intelligence era. Nature Reviews Drug Discovery, 19(6):353–364, May 2020. [4] Serge Mignani, Scot Huber, Helena Tomás, João Rodrigues, and Jean-Pierre Majoral. Why and how have drug discovery strategies in pharma changed? what are the new mindsets? Drug Discovery Today, 21(2):239–249, 2016. [5] Kenneth I Kaitin. Deconstructing the drug development process: the new face of innovation. Clinical Pharmacology & Therapeutics, 87(3):356–361, 2010. [6] Christopher P Adams and Van V Brantner. Estimating the cost of new drug development: is it really $802 million? Health affairs, 25(2):420–428, 2006. [7] Joshua Meyers, Benedek Fabian, and Nathan Brown. De novo molecular design and generative models. Drug Discovery Today, 26(11):2707–2715, 2021. [8] Jean-Louis Reymond. The chemical space project. Accounts of Chemical Re- search, 48(3):722–730, 2015. [9] Melodie Christensen, Lars PE Yunker, Folarin Adedeji, Florian Häse, Loïc M Roch, Tobias Gensch, Gabriel dos Passos Gomes, Tara Zepel, Matthew S Sig- man, Alán Aspuru-Guzik, et al. Data-science driven autonomous process opti- mization. Communications Chemistry, 4(1):1–12, 2021. [10] Allan M. Jordan. Artificial intelligence in drug design—the storm before the calm? ACS Medicinal Chemistry Letters, 9(12):1150–1152, 2018. [11] Alleyn T Plowright, Craig Johnstone, Jan Kihlberg, Jonas Pettersson, Graeme Robb, and Richard A Thompson. Hypothesis driven drug design: improving quality and effectiveness of the design-make-test-analyse cycle. Drug discovery today, 17(1-2):56–62, 2012. 45 Bibliography [12] Daniel Merk, Lukas Friedrich, Francesca Grisoni, and Gisbert Schneider. De novo design of bioactive small molecules by artificial intelligence. Molecular informatics, 37(1-2):1700153, 2018. [13] W Patrick Walters and Mark Murcko. Assessing the impact of generative ai on medicinal chemistry. Nature biotechnology, 38(2):143–145, 2020. [14] A. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In NIPS, 2001. [15] Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor W Coley. Sample efficiency matters: A benchmark for practical molecular optimization. arXiv preprint arXiv:2206.12411, 2022. [16] Thomas Blaschke, Josep Arús-Pous, Hongming Chen, Christian Margreitter, Christian Tyrchan, Ola Engkvist, Kostas Papadopoulos, and Atanas Patronov. Reinvent 2.0: an ai tool for de novo drug design. Journal of Chemical Infor- mation and Modeling, 60(12):5918–5922, 2020. [17] Steven M. Mennen, Carolina Alhambra, C. Liana Allen, Mario Barberis, Simon Berritt, Thomas A. Brandt, Andrew D. Campbell, Jesús Castañón, Alan H. Cherney, Melodie Christensen, David B. Damon, J. Eugenio de Diego, Susana García-Cerrada, Pablo García-Losada, Rubén Haro, Jacob Janey, David C. Leitch, Ling Li, Fangfang Liu, Paul C. Lobben, David W. C. MacMillan, Javier Magano, Emma McInturff, Sebastien Monfette, Ronald J. Post, Danielle Schultz, Barbara J. Sitter, Jason M. Stevens, Iulia I. Strambeanu, Jack Twilton, Ke Wang, and Matthew A. Zajac. The evolution of high-throughput experimen- tation in pharmaceutical development and perspectives on the future. Organic Process Research & Development, 23(6):1213–1242, 2019. [18] Favour Danladi Makurvet. Biologics vs. small molecules: Drug costs and patient access. Medicine in Drug Discovery, 9:100075, 2021. [19] Jungseog Kang, Chien-Hsiang Hsu, Qi Wu, Shanshan Liu, Adam D Coster, Bruce A Posner, Steven J Altschuler, and Lani F Wu. Improving drug discovery with high-content phenotypic screens by systematic selection of reporter cell lines. Nature biotechnology, 34(1):70–77, 2016. [20] JP Hughes, S Rees, SB Kalindjian, and KL Philpott. Principles of early drug discovery. British Journal of Pharmacology, 162(6):1239–1249, 2011. [21] Independent High-Level Expert Group on Artificial Intelligence. Ethics guide- lines for trustworthy ai. European Commission, 2019. [22] Xiaochu Tong, Xiaohong Liu, Xiaoqin Tan, Xutong Li, Jiaxin Jiang, Zhaop- ing Xiong, Tingyang Xu, Hualiang Jiang, Nan Qiao, and Mingyue Zheng. Generative models for de novo drug design. Journal of Medicinal Chemistry, 64(19):14011–14027, 2021. [23] Woosung Jeon and Dongsup Kim. Autonomous molecule generation using rein- forcement learning and docking to develop potential novel inhibitors. Scientific Reports, 10(1):22104, December 2020. 46 Bibliography [24] Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and de- velopment. Nature reviews Drug discovery, 18(6):463–477, 2019. [25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduc- tion. MIT press, 2018. [26] Shweta Bhatt. Reinforcement learning 101, Learn the essentials of Reinforcement Learning!, 2018. https://towardsdatascience.com/ reinforcement-learning-101-e24b50e1d292, Last accessed on 2022-02-21. [27] Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. In International Conference on Machine Learning, pages 4045–4054. PMLR, 2018. [28] Quyet Nguyen, Noel Teku, and Tamal Bose. Epsilon greedy strategy for hyper parameters tuning of a neural network equalizer. In 2021 12th International Symposium on Image and Signal Processing and Analysis (ISPA), pages 209– 212. IEEE, 2021. [29] Aakash Maroti. Rbed: Reward based epsilon decay. arXiv preprint arXiv:1910.13701, 2019. [30] Ian Osband, Benjamin Van Roy, Daniel J Russo, Zheng Wen, et al. Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1– 62, 2019. [31] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018. [32] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [33] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017. [34] Yuxi Li. Deep reinforcement learning: