MASTER’S THESIS 2024 De NovoMolecular Generation of Molecules with Consistent Synthetic Strategy ALBIN EKBORG DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy ALBIN EKBORG © ALBIN EKBORG, 2024. Supervisor: Jon Paul Janet, AstraZeneca Supervisor: Alexey Voronov, AstraZeneca Examiner: Mats Granath, Department of Physics Master’s Thesis 2024 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Gothenburg, Sweden 2024 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy Albin Ekborg Department of Physics Chalmers University of Technology Abstract The discovery of new pharmaceutical compounds is essential for developing novelmedicines, a process that is both costly and time-consuming. In this context, generative Artificial Intelli- gence (AI) methods offer a valuable, cost-effective approach for generating large amounts of de novo pharmaceutical candidates. However, these drug candidates must still be synthesized for use in in vitro and in vivo studies, necessitating the creation of efficiently synthesizable drug candidates. In this thesis, we explore two AI models for drug design developed by As- traZeneca: REINVENT and AiZynthFinder. REINVENT is a generative AI model designed for creating small molecules fine-tuned with Reinforcement Learning (RL), whereas AiZyn- thFinder is a framework for retrosynthesis. We explore the combination of these models through two novel reward functions and demonstrate that REINVENT, through querying of AiZynthFinder, can be conditioned to generate molecules that share an identical synthetic route. This is done in twoways, by provision of a template reaction, or by letting REINVENT independently find identical routes, allowing the generated molecules to be synthesized in parallel. KEYWORDS: REINVENT, AIZYNTHFINDER, DEEP LEARNING, DRUG DESIGN. * Acknowledgements First of all, I would like to thank my supervisors, Alexey Voronov and Jon Paul Janet, as well as my examiner Mats Granath, for providing me with guidance and support throughout my thesis. Furthermore, I would like to thank Elias Stenhede Johansson andMauritz Kööhler for their detailed proofreading and comprehensive feedback, bringing much-needed clarity to otherwise convoluted sentences. Finally, I would like to thankWojciech & Joel - my master’s studies just wouldn’t have been the same without you. Albin Ekborg, Gothenburg, June 2024 * Table of Contents 1 Introduction 1 1.1 REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 AiZynthFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Chemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Representation of Molecular Structure . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Chemical Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.3 Retrosynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.4 Classification of Molecular Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.5 Tree Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.3 RNN training on Next-Token Prediction . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.4 Sequence Generation with RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.6 Policy-Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.7 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.8 Fine Tuning Molecular Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.9 Scoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.10 Multiple Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 AiZynthFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Computer Aided Synthesis Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Monte Carlo Tree Search (MCTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Neural Network guided MCTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Methods 11 3.1 Molecular Generation conditioned on Synthetic Similarity . . . . . . . . . . . . . . . . . . 11 3.1.1 Connecting AiZynthFinder and REINVENT . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Route Similarity in REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.3 Scoring based on Reference Route . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.4 Emergent Route Distance Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Parameter Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Multiple Parameter Optimization - Realistic Application . . . . . . . . . . . . . . . . . . . 13 4 Results 14 4.1 Reference Route Scoring Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.1 Parameter Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.2 MPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Emergent Route Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 Discussion 17 5.1 Reference Route Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1.1 MPO - Real Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Emergent Route Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2.1 MPO for the emergent scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.3 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6 Conclusion 19 Appendix A REINVENT configuration Options 1 A.1 Randomizing SMILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2 Diversity Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2.1 Penalize Same Smiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2.2 Identical Murcko Scaffold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.3 Scoring Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.3.1 Parameterized Sigmoid Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . 2 A.3.2 Truncated Exponential Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . 2 A.3.3 Lenient Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 A.3.4 Truncated Inverse Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Appendix B Scoring Components in MPO 2 B.1 Molecular Weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 B.2 ROCS Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 B.3 Custom Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 B.4 Quantitative Estimate of Drug-Likeness (QED) . . . . . . . . . . . . . . . . . . . . . . . . 3 B.5 Hydrogen Bond Donors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy Albin Ekborg Chalmers University of Technology June 2024 Abstract The discovery of new pharmaceutical compounds is essential for developing novel medicines, a process that is both costly and time-consuming. In this context, generative Artificial Intelligence (AI) methods offer a valuable, cost-effective approach for generating large amounts of de novo pharmaceutical candi- dates. However, these drug candidates must still be synthesized for use in in vitro and in vivo studies, necessitating the creation of efficiently synthesizable drug candidates. In this thesis, we explore two AI models for drug design developed by AstraZeneca: REINVENT and AiZynthFinder. REINVENT is a generative AI model designed for creating small molecules fine-tuned with Reinforcement Learning (RL), whereas AiZynthFinder is a framework for retrosynthesis. We explore the combination of these models through two novel reward functions and demonstrate that REINVENT, through querying of AiZynthFinder, can be conditioned to generate molecules that share an identical synthetic route. This is done in two ways, by provision of a template reaction, or by letting REINVENT independently find identical routes, allowing the generated molecules to be synthesized in parallel. 1 Introduction The process of taking a drug from discovery to mar- ket can take up to 12-15 years, and cost in excess of $1 billion [1]. During this process, a drug candidate undergoes heavy scrutiny in regard to its safety and efficacy. If a drug fails, it is due to it either not be- ing safe, or not being efficient. As such, being able to discover pharmaceutical candidates that already satisfy the requirements of safety and efficacy poses tremendous value for the pharmaceutical industry. The process of finding novel pharmaceutical candi- dates has traditionally relied on rational drug de- sign, a systematic approach of designing a molecule to fit a predefined set of properties as well as re- quirements on safety and efficacy [2, 3]. This re- quires both conscious effort and detailed knowledge of a therapeutic target, which is time-consuming and costly. In this context, the prospect of leverag- ing generative Artificial Intelligence (AI) methods becomes valuable as a cost-effective approach for generating a large number of de novo pharmaceu- tical candidates [4]. While being able to generate a large number of po- tential drug candidates is useful in its own right, drug candidates still need to be synthesized for use in in vitro and in vivo studies. A common in vitro method for synthesizing a large number of poten- tial drug candidates is Parallel Library Synthesis (PLS) [5]. In PLS, the drug candidates are system- atically synthesized in parallel under the same con- ditions, enabling the efficient creation of chemical libraries containing different potential drug candi- dates. This process is highly beneficial for drug dis- covery as it allows for the rapid creation of chem- ical compounds in parallel. However, it requires that the molecules share a synthetic pathway, and therefore, we need design methods that can produce sets of candidates with a consistent strategy. Thus, there is a compelling case for a generative AI model that is capable of performing molecular design under the constraint of synthetic similarity, such that the downstream synthesis can be done in parallel. In this thesis, we present one implemen- tation of such a method, relying on two AI frame- works developed by AstraZeneca, REINVENT and 2 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy AiZynthFinder [6, 7]. A brief introduction to the systems will follow. 1.1 REINVENT REINVENT is the name of a neural-network-based generative AI framework that performs the design of small molecules [6, 8]. A REINVENT model is a Recurrent Neural Network (RNN) trained to generate molecular sequences known as Simplified Molecular Input Line Entry System (SMILES) [9]. A REINVENT model (henceforth referred to as an agent) can be additionally fine-tuned to generate SMILES pertaining to a certain chemical nature through the use of Reinforcement Learning (RL), which has been shown to be a flexible and reliable method of guiding the agent to specific regions of chemical space [6, 8], while being able to satisfy multiple constraints at the same time [10]. This fine-tuning mechanism is noteworthy, as it can ef- fectively filter the generation of compounds from the set of all possible molecules to a subset with de- sirable properties, as distinguished by certain scor- ing components. A scoring component in REINVENT is any func- tion that acts on a generated molecule and returns a value between 0 and 1, denoting its desirabil- ity. It is a very important concept as it is the primary mechanism through which molecular gen- eration can be guided. The mechanism is explained in detail in Section 2. 1.2 AiZynthFinder AiZynthFinder is the name of an algorithmic syn- thesis planning software that breaks down a tar- get molecule into a tree of its constituent reagents and reactions using different search algorithms, pri- marily by neural network-guided Monte Carlo Tree Search (MCTS) [7, 11]. It attempts to find plau- sible synthetic routes by recursively decomposing the target molecule into simpler precursors through a series of potential chemical reactions, ultimately seeking to reach commercially available building blocks. The software utilizes a database of known reac- tions, such as the United States Patent and Trade- mark Office (USPTO) repository, to guide its de- composition strategy. The use of neural networks allows it to predict the feasibility of chemical trans- formations, while the MCTS algorithm ensures ef- ficient exploration of the search space. By priori- tizing routes based on predicted feasibility, AiZyn- thFinder is able to propose synthesis pathways that are not only plausible but also optimal in terms of cost, time, and resource utilization. 1.3 Aim Combining REINVENT and AiZynthFinder offers an interesting pipeline for designing novel drug candidates, where REINVENT generates molecu- lar structures that adhere to specific design goals, and AiZynthFinder is able to plan their synthetic route. In this thesis, we will explore a method for generat- ing compounds constrained by synthetic similarity. The objective is to determine whether the genera- tion of molecules can be guided by their synthetic routes through the combined use of AiZynthFinder and REINVENT, integrated as a scoring compo- nent where the desirability of a generated molecule is determined based on its synthetic route, as sug- gested by AiZynthFinder. Specifically, the thesis aims to address the following questions: • Using REINVENT in conjunction with AiZyn- thFinder, can we generate molecules that have a synthetic route identical to an external ref- erence? • Is it possible to discover molecules that share a synthetic route emergently, without providing a reference? Theory 3 2 Theory The following section is divided into three parts, Chemistry, REINVENT, and AiZynthFinder. The implementations of these concepts are described in section 3. It is assumed that the reader is familiar with basic topics in machine learning. 2.1 Chemistry This section aims to provide a brief overview of the chemical theory necessary to understand how the systems are implemented. More specifically, this section touches upon molecular representa- tions, synthesis, and classification. 2.1.1 Representation of Molecular Structure The structure of a molecule can be represented se- quentially through the use of Simplified Molecular Input Line Entry System (SMILES), an ASCII- compatible molecular representation [9] where atoms and bonds are represented using common symbols. It is possible to describe a given molecule in multiple valid ways, which can cause ambi- guity [12]. To remedy this, SMILES are often canonicalized. The canonical SMILES ensure that the SMILES is a unique identifier for a particular molecule. Figure 2.1 shows an example of a struc- tural representation of the molecule caffeine, as well as its corresponding Canonical SMILES. O N N N O N SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C Figure 2.1: Molecular structure of the molecule caffeine, alongside its SMILES representation. SMILES provide a natural way of working with molecules in a sequential manner, lending itself to the application of machine learning models. 2.1.2 Chemical Synthesis Chemical synthesis is the process of combining molecules under certain conditions (chemical reac- tions) to form other molecules (products). It can be viewed as a tree, wherein multiple reactions are connected in a sequential or branched man- ner, where each node represents a molecule or a chemical reaction, and a chemical reaction connects two molecular nodes. See Figure 2.2 for a chemi- cal synthesis tree. In this work, synthetic tree and synthetic route will be used interchangeably. For Figure 2.2: A 2-step chemical synthesis tree, showing the synthesis of the fungal compound muscimol. Here, chemical reactions are signified by •, whereas the nodes containing molecular structures are different compounds partaking in the reactions. The figure was rendered using the Python package RDKit [13]. the rest of this thesis, a more compact illustration of synthetic trees will be favored, where molecular structures are omitted. 2.1.3 Retrosynthesis Retrosynthesis can intuitively be thought of (but not explicitly considered) as the inverse opera- tion of chemical synthesis. Instead of designing a pathway from reactants to a final product, ret- rosynthesis works backward from the desired target molecule, identifying simpler precursor compounds that can lead to its synthesis. Essentially, it poses an answer to the question ”How can this molecule be synthesized?”. 2.1.4 Classification of Molecular Reactions In this thesis, molecular reactions are classified according to the reaction classification system NameRxn [14], which provides a hierarchical ap- proach to naming chemical reactions. The hierar- chy consists of three levels, a superclass, class, and 4 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy subclass. The hierarchical nature takes chemical similarity into account, wherein two distinct reac- tions belonging to the same superclass are inher- ently more chemically similar than two reactions belonging to two different superclasses. In Exam- ple 1 below, the class and name of three chemical reactions are shown. Example 1: 2.1.1 Amide Schotten-Baumann Reaction 2.1.8 Thioamide Schotten-Baumann Reaction 3.11.130 Piancatelli rearrangement It should be apparent that the classification is hi- erarchical: 2.x.x covers one type of chemical re- action, with the 2.1.x category specifically cov- ering ’Schotten-Baumann’ reactions. In contrast, 3.11.130 represents a completely different class of chemical reactions. 2.1.5 Tree Edit Distance Tree Edit Distance (TED) is a similarity metric for graphs, which quantifies the similarity between a pair of trees by measuring the operational cost of transforming one tree into the other [15]. The available operations are insertion, deletion and re- naming of labels. These operations, performed on pairwise comparisons of node or edge labels in a tree, are all that is necessary to reach identity be- tween two trees. If two trees are topologically identical but have different node features, the renaming of said fea- tures is all that is required to reach identity. Sim- ilarly, if two trees have an identical root structure but a different branch, the insertion or deletion of said branch is all that is required to reach identity. What TED concerns is the finding of the minimal cost of the sequence of operations performed be- tween two trees. By assigning a different cost to each of these operations, one can tailor the metric for specific purposes. This thesis implements TED such that chemical relevance between chemical reactions in synthesis trees is accounted for. In order to measure the similarity between chemical synthesis routes in a manner that preserves chemical relevance, TED is weighed such that similar chemical reactions cost less to rename. In practicality, this means lever- aging the hierarchical nature of the reaction class naming system described in section 2.1.4, such that more similar chemical reactions cost less to re- name. It is worth noting that the notion of chem- ical similarity of synthesis routes only considers reaction-level features of a synthesis tree, and not the molecules partaking in said reactions. This is because the primary focus is on the reactions in- volved, rather than the participating molecules. To further illustrate how TED calculates distance on synthesis routes, an example is provided in Ex- ample 2. Example 2: Tree Edit Distance on synthesis routes Consider two synthetic routes, A and B, illustrated in Figure 2.3 below. Note that the node features have been excluded, as they are not taken into ac- count when TED is calculated. A B Figure 2.3: Two reaction trees A and B. Here, ■ represents a reaction, color denoting a difference in reaction class, and ◦ represents a molecular node. We can see that the operational cost of transform- ing A to B corresponds to one insertion, and one renaming operation. 2.2 REINVENT This section describes the original implementation of REINVENT. It cites closely the original work [8], and aims to provide the reader with a brief but complete understanding of how REINVENT works. In this thesis, the focus lies on the introduction 2.2 REINVENT 5 of two scoring components into REINVENT: the Reference scoring component, which attempts to generate molecules that have an identical synthetic route to a provided reference, and the Emergent scoring component, which instead produces syn- thetically consistent molecules emergently. For this reason, it is important to have a good understand- ing of how the scoring system works. We begin by considering the architectural backbone of REIN- VENT, and then proceed to explain the fine-tuning mechanism. 2.2.1 Recurrent Neural Networks A Recurrent Neural Network (RNN) is a type of neural network architecture that is designed to cap- ture dependencies between steps in sequential data [16]. They are characterized by their feedback con- nections, a connection between a hidden neuron and itself, see Figure 2.4a. The feedback con- nections allow the previous neuron state to influ- ence the interpretation of the current state. RNNs are trained with Backpropagation Through Time (BTT), where the feedback is unrolled over sev- eral time steps. This approach essentially allows for gradient flow to be traced back through each time step, enabling the network to learn the associ- ation between a given state and its relative position in a sequence [17]. See Figure 2.4b for a represen- tation of the feedback unrolling. ht xt Ot whx woh whh (a) h1 x1 O1 whx woh h2 x2 O2 whx woh whh . . . ht xt Ot whx woh (b) Figure 2.4: Schematics of a Recurrent Neural Network (left) and it’s corresponding unrolled state (right). Here, w relates the connection weights between neurons, t being the time step and O, x and h being the output, input and hidden states respectively. The presence of feedback connections makes RNNs particularly suitable for sequence prediction, and by extension sequence generation, since context about earlier time steps is propagated through the network. However, the feedback mechanism also makes learning long-term dependencies diffi- cult due to the increased susceptibility to vanishing or exploding gradients as the sequence length in- creases [16, 17]. This is often remedied by the intro- duction of gated neuron mechanisms, which are ca- pable of regulating the flow of information through the network by retaining or discarding information dynamically, through the inclusion of learnable pa- rameters that can modify the state of a neuron [18, 19]. This enables the network to learn what in- formation is relevant to keep or discard over long sequences. One such mechanism, the Gated Recur- rent Unit (GRU), will be considered in the following section. 2.2.2 Gated Recurrent Unit (GRU) The GRU is a type of neuron/cell in an RNN de- signed to mitigate the risk of vanishing and explod- ing gradients for long sequences. This is achieved by replacing the standard neuron with a unit that has two gates, the reset gate and update gate. These gates modulate the flow of information in the feedback connections. They are very similar in their definition, but their influence on the update rule is distinct. The reset gate determines how much of the previous state should be forgotten. It is defined as rt = σ(Wr · [ht−1, xt] + br), (1) with rt being the reset gate output at time t, σ denoting the sigmoid function, Wr being the learn- able weight matrix for the reset gate, ht−1 being the previous hidden state, xt being the input at time t and br being the bias. The update gate determines how much of the previ- ous state should be retained. It is defined similarly to eq. (1), zt = σ(Wz · [ht−1, xt] + bz), (2) with zt being the output of the update gate at time t and the suffix z denoting a separate set of parame- ters for the update function for the same quantities as in eq. (1). 6 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy The effect of the gates is realized in two stages. First, the effect of the reset gate is incorporated by calculating the new memory content of the current state: h̃t = tanh(Wc · [rt ⊙ ht−1, xt] + b), (3) where h̃t denotes the candidate hidden state, unaf- fected by the effect of the update gate, and Wc de- notes yet another weight matrix, exclusive to this operation. Additionally, ⊙ refers to element-wise multiplication. All of this is then combined into the final hidden state, which is calculated according to ht = zt ⊙ ht−1 + (1 − zt) ⊙ h̃t. (4) This ensures that, if the update gate zt is close to unity, the previous hidden state ht−1 is largely carried over to the final state ht. Conversely, if zt is close to 0, the final hidden state is almost entirely replaced by h̃t The combined effect of the gating mechanism al- lows the GRU to retain relevant information in the network states over longer sequences, reducing the degradation that happens as the sequences get longer. 2.2.3 RNN training on Next-Token Prediction The RNN is trained on the task of next-token pre- diction by maximum likelihood estimation of the next token in the target sequence, conditioned on the previous tokens in the sequence [8]. Here, a token is a one-hot encoded representation of a SMILES character. The output of the network at each step is the probability distribution over the set of available tokens, and the goal is to maximize the likelihood of the correct token xt. This is equiv- alent to minimizing the negative log-likelihood of the distributions for each step t, and the training loss becomes J(θ) = − T∑ t=1 log P (xt|xt−1, ..., x1) (5) with J(θ) being the loss function, calculated batch- wise with respect to the network parameters θ. The network is then trained with backpropagation through time, minimizing the loss using Stochastic Gradient Descent (SGD) with the update rule θt+1 = θt − α∇J(θt), (6) with α being the learning rate. 2.2.4 Sequence Generation with RNNs Generation of sequences is performed in an auto- regressive manner, wherein the next token in the sequence is randomly sampled from the output dis- tribution at each time step. Each generated token is then iteratively fed back as input to the network at the next time step. This relies on two tokens that encode the start of a sequence, GO, and the end of a sequence EOS. The generation ends once the EOS token is drawn. Figure 2.5 provides an overview of token generation for an RNN with one cell. h1 GO x1 h2 x2 . . . ht EOS Figure 2.5: A schematic of an RNN performing autoregressive generation of tokens. The output of a time step t is fed as input to the next step, until the termination token EOS has been generated. 2.2.5 Reinforcement Learning Reinforcement Learning (RL) is a machine learn- ing optimization technique where an agent learns from the consequences of its actions a, and is re- warded for desirable behaviors [20]. Unlike other learning algorithms, RL does not rely on prede- fined labels but instead uses a policy π which aims to determine the appropriate action in response to a specific state s of the environment. This policy can be either deterministic, wherein one action is determined for each state, or probabilistic. For a probabilistic policy, the policy describes the prob- ability distribution over actions given a particular 2.2 REINVENT 7 state. Additionally, a reward function R(s, a) as- sesses the desirability of an action in a certain state. The goal of the agent is then to learn the policy that maximizes the expected reward E(Gt) where Gt is the sum of discounted rewards at timestep t, Gt = ∞∑ k=0 γkRt+k+1. (7) Here, γ ∈ [0, 1] is the discount factor, through which future rewards are devalued. This means that a reward received at k time steps in the fu- ture is worth γk−1 times as much as if it was given immediately [20]. Note that there are two indices that correspond to time, with t indicating the cur- rent timestep at which the decision is made, and k serving as the step offset from t, counting the num- ber of timesteps into the future for each reward. We can then form the policy-dependent action- value function, which represents the expected re- turn of taking action a in state s, and thereafter following the policy π. The action-value function is written as qπ(s, a) = Eπ[Gt|s, a] = [ ∞∑ k=0 γkRt+k+1|s, a ] π . (8) The action-value function allows us to evaluate the effectiveness of each action in a specific state under the current policy, forming a relationship between policy and expected reward. We will leverage this in the following section. 2.2.6 Policy-Gradient Methods Through parameterization of the policy, policy- gradient methods aim to directly optimize the pol- icy π(a|s, θ) through adjustments of the policy pa- rameters θ [20]. Any parameterization is possible, as long as the policy is differentiable with respect to its parameters. Presented in eq. (9) is the policy-gradient Theorem for an episodic task with a probabilistic parame- terized policy π(a|s, θ), with parameters θ. The theorem states that the gradient of the loss with respect to the policy parameters is proportional to the expected return of operating under the policy [20]. We can describe this more formally as follows. Let µ(s) denote the on-policy distribution over the visited states, qπ(a, s) be the expected return of fol- lowing the policy at state s, and ∇π(a|s, θ) be the gradient of the policy with respect to the parame- ters θ. Then, the policy gradient theorem can be written as ∇J(θ) ∝ ∑ s µ(s) ∑ a qπ(a, s)∇π(a|s, θ), (9) where ∇J(θ) is the gradient of the loss with regards to the policy parameters θ [20]. One specific type of policy-gradient method is the REINFORCE algorithm [20], which subjects an agent to learning the optimal policy by explicitly computing the gradients of the policy with respect to its performance, evaluated episodically. In this case, an episode is a task that has a clear beginning and end and is recorded as a sequence of actions. In the following section, the REINFORCE algorithm will be covered briefly. 2.2.7 REINFORCE The REINFORCE algorithm employs stochastic gradient descent to update policy parameters. In this section, we will explore the derivation of the update rule that enables this process. We begin by expressing eq. (9) in terms of the ex- pected value over both actions and states under the policy, ∇J(θ) ∝ Eπ [ qπ(At, St) ∇π(At|St, θ) π(At|St, θ) ] , (10) which can be expressed even more compactly by noting that qπ(At, St) = Eπ[Gt|St, At], i.e. the return at time t, and by utilizing the identity ∇log x = ∇x x : ∇J(θ) ∝ Eπ[Gt∇log π(At|St, θ)]. (11) Equation (11) reveals a quantity that can be sam- pled at each time step, and which is proportional to the gradient of the loss. This allows us to form the REINFORCE update for the policy parameters, denoted as a step in stochastic gradient descent: θt+1 ≈ θt − αGt∇log π(At|St, θt), (12) where α is the learning rate. 8 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy This is important, because it allows us to frame the fine tuning process based entirely on the return of each sequence. By letting the return of a sequence denote its desirability, we can condition the genera- tion to favor compounds which are highly desirable. The process of formulating such a return is covered in the following section. 2.2.8 Fine Tuning Molecular Generators Recall that the output of the network at each time step is the probability distribution over the possible tokens given its current state. By letting an action denote the addition of a token, and the state being the current sequence of tokens, the network can be viewed as a policy. This is the policy which we seek to optimize. Consider an RNN trained on the task of next-token prediction of SMILES strings, henceforth referred to as the prior network. Now consider a separate network that is an identical copy of the prior net- work, which we will refer to as the agent network. Our goal is to have the agent learn to produce molecules that score highly under a scoring func- tion S ∈ [0, 1], that is, we want to increase the likelihood of generating high-scoring sequences un- der the agent policy. We begin by subjecting the agent to the decision of what token to add given its current state. Let at denote the action of adding a token to a se- quence being generated, and st be the state at time t. The sequence of performed actions is denoted A. Then, P (A) = T∏ t=1 π(at|st) (13) represents the likelihood of the formed sequence A under the policy π [8]. At this point, we want to up- date the policy of the agent such that the expected score of the generated sequences S(A) is increased, whilst also anchoring the agent to the prior in such a way that the agent policy doesn’t stray too far away from the prior. This is because the prior re- tains good knowledge of SMILES syntax. The an- choring is achieved in two steps. First, we denote an augmented likelihood, logP (A)U = logP (A)prior + σS(A), (14) which specifies the likelihood of the sequence under the prior, modulated by the score of the sequence, where σ is a scalar weight. Then, we denote the re- turn of a sequence G(A) as the negative squared dif- ference of the augmented and agent log-likelihoods: G(A) = −[log(P (A)Agent − logP (A)U]2, (15) which aims to capture the agreement between the likelihoods for a given sequence. It is worth noting that this formulation is an extension to the REIN- FORCE algorithm made by the original authors of REINVENT [8]. By substituting the long-term reward in eq. (12) with G evaluated on an entire sequence, we are able to frame the fine-tuning problem in terms of a REINFORCE algorithm update. Thus, by sub- jecting the agent to minimizing the loss function J(θ) = −G, we gain two things. First, we gain the update rule with which we are able to traverse parameter space (eq. (12)), and secondly, we make sure that the training process favors highly scoring sequences whilst also minimizing the discrepancy between the agent and the prior policy. 2.2.9 Scoring System The scoring in REINVENT aims to determine the value of each generated compound, and is vital for guiding the agent in the desired direction. As this score is what is used to determine the augmented likelihood of a sequence (eq. (14)), and by extension the return (eq. (15)) which we use to calculate our parameter updates, it naturally becomes an impor- tant part in guiding the fine-tuning. A scoring function S(A) is defined to assess the de- sirability of a generated sequence A. It returns a numerical value in the range [0, 1], where a higher score denotes greater desirability. To better explain the process of scoring, an example of scoring molec- ular sequences based on their molecular weight is provided. Example 3: Scoring based on molecular weight threshold Consider a generated molecule described by the se- quence A with molecular weight WA, and a scoring function S(A) = { 1 if Wmin ≤ WA ≤ Wmax 0 otherwise (16) 2.3 AiZynthFinder 9 with lower and upper thresholds Wmin, Wmax re- spectively. It is evident from the definition of S(A) in Equation (16) that a molecule with molecular weight outside the interval will get a score of 0. Together with Equation (14), this results in no up- dates to the agent policy in the direction of such a molecule. Example 3 provides a trivial case, but the intuition holds true for more complex scoring functions as well. For example, it is common to use a continuous scoring function that has a positive slope in the direction of the highest-scoring region. This will ensure that as the output moves in the direction of the desirable region, better and better scores are returned, continuously rewarding the agent. Such scoring components are covered in appendix A. 2.2.10 Multiple Parameter Optimization It is possible to evaluate multiple scoring compo- nents per generated sequence. The score of sepa- rate components can then be aggregated into one by arithmetic or geometric mean. As an example of such an aggregate, consider two arbitrary but distinct scoring functions. An agent constrained by these functions would then be driven to gener- ate sequences that score highly under both scoring functions. For the rest of this thesis, the aggrega- tion of multiple scoring functions will be referred to as Multiple Parameter Optimization (MPO). The relationship between scoring components dic- tates the behavior of the RL. In a simple case, two scoring components can be synergistic, where com- pounds that score highly under one component also score high under the other, narrowing the search. It can also be destructive, wherein different com- ponents ”pull” in different directions of chemical space. As the number of components grow, the in- teractions between components grow increasingly difficult to analyze. 2.3 AiZynthFinder Like section 2.2, this section aims to provide the reader with a brief understanding of the retrosyn- thetic framework AiZynthFinder. It is worth not- ing that, for the purpose of this project, this system is used as-is without any modifications. As such, the explanation of the system will be more qualita- tive in nature, not going into great detail in favor of brevity. 2.3.1 Computer Aided Synthesis Planning The goal of retrosynthetic analysis is to suggest possible synthetic routes for a given chemical com- pound [21]. In Computer Aided Synthesis Planning (CASP), this is performed algorithmically in silico [22, 23]. Since the set of possible routes for a given molecule is large [24–26], a retrosynthetic algorithm will search for possible solutions. This search is condi- tioned on some external factors, such as availabil- ity of reagents, chemical complexity, or cost [11, 27]. In AiZynthFinder, this search is a Monte Carlo Tree Search (MCTS) guided by different neural net- work policies [7, 11]. 2.3.2 Monte Carlo Tree Search (MCTS) MCTS is a policy-based search algorithm designed to iteratively build a decision tree of optimal ac- tions [28, 29]. In order to determine the optimal actions, it estimates the value of different paths in the tree through Monte-Carlo simulations of poten- tial outcomes. In its basic form, it consists of 4 steps, 1. Selection, which starts at the root node and recursively applies a selection policy to descend the tree. 2. Expansion, which adds children to a selected node according to the available actions. 3. Simulation, wherein an outcome for the new children is simulated based on a simulation pol- icy. The reward is evaluated once a simulation reaches a terminal state. 4. Backpropagation, where the simulation result is propagated upwards from the traversal, up- dating the statistics for each visited node. An illustration of the steps in MCTS is shown in Section 2.3.2 below. The Selection and Expansion are governed by the Selection Policy, a function that determines which 10 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy 1. Selection 2. Expansion 3. Simulation 4. Backpropagation Figure 2.6: The 4 phases of the Monte Carlo Tree Search Algorithm. nodes to grow based on some measure, often the ex- pected reward [28, 29]. The Simulation is governed by the Simulation Policy, which aims to estimate a valuation for a non-terminal state. This is the Monte-Carlo part of the algorithm, where an eval- uation of the action is performed by stochastically traversing the possible states starting from the cur- rently selected state. This process continues until a specified recursion depth or a terminal node is reached [28, 29]. The basic MCTS algorithm is very flexible, and what AiZynthFinder concerns is an adaptation of the original MCTS algorithm, which will be cov- ered in the following section. 2.3.3 Neural Network guided MCTS AiZynthFinder builds upon the basic MCTS algo- rithm by incorporating 3 pre-trained Neural Net- work policies [11], referred to as 3N-MCTS. In AiZynthFinder, an action represents a chemical re- action, and a state represents a molecule. The 3 Neural Policies are as follows: 1. Expansion Policy Network, which guides the tree search to only consider the most promising actions. 2. In-Scope Filter Network, a binary classifier trained to predict the chemical feasibility of a suggested reaction. 3. Rollout Policy Network, which determines how actions are sampled in the simulation step. Es- sentially, the Monte-Carlo simulation is done under the distribution of the Rollout Policy. This formulation allows for considerable chemical knowledge to be injected into the decision process, as well as optimizations with regards to cost or availability of precursor, wherein multiple networks can be trained at specific tasks that narrow the combinatorial complexity of the search while simul- taneously increasing the search quality. Methods 11 3 Methods The following section describes the implementation of the two scoring components, which is the main contribution of this thesis. Additionally, it presents the experimental setup through which the perfor- mance of the scoring components was evaluated. It builds upon the knowledge presented in section 2, and serves as a basis for the results presented in section 4. 3.1 Molecular Generation conditioned on Syn- thetic Similarity In order to condition a REINVENT agent to out- put molecules that have a similar synthetic route, molecules that share synthetic routes need to be scored highly under the agent. This requires two things - an online retrieval of synthesis routes for each generated molecule, and a robust notion of route similarity. In Figure 3.1 on the next page, a high-level overview of the system can be seen. The implementation of the separate parts will be covered in the following sections. 3.1.1 Connecting AiZynthFinder and REINVENT Both AiZynthFinder and REINVENT are imple- mented in Python, exist as CLI tools, and can interact with their environment through different commands. An AiZynthFinder run starts by be- ing supplied with a set of SMILES, on which it performs 3N-MCTS and returns a set of suggested synthesis routes. A REINVENT run, on the other hand, generates a batch of SMILES according to the agent policy. Thus, in order to achieve online retrieval of synthetic routes for each molecule, we simply start a run of AiZynthFinder that performs retrosynthesis on each generated molecule. This is illustrated further in the following example. Example 4: An example scoring run. Consider a run of REINVENT generating a batch of SMILES. As the batch is completed, the agent seeks to determine the score for each compound and update its policy accordingly. At scoring time, REINVENT starts AiZynthFinder as a subprocess, feeding to it each generated SMILES, and getting in return a set of possible synthesis routes for each SMILES. At this point, each route for each gener- ated molecule is scored according to its similarity to a provided reference route. The reference route is fixed for the entire run. The best scoring tree (being the one with the lowest distance to refer- ence) for each molecule is saved, and the score for said tree is reported as the score of the molecule. The augmented and the agent likelihoods are cal- culated, formulating the batch-wise loss, and the parameters of the agent network are updated ac- cording to eq. (12). This procedure iterates until it is terminated. 3.1.2 Route Similarity in REINVENT Measuring similarity between pairs of synthesis trees is done using a Python implementation of TED [15, 30] with a custom configuration for the operational costs. The operational costs are depen- dent upon the NameRXN classification system as described in section 2.1.5, and are presented in Ta- ble 1. Table 1: Table of Operational Costs for the TED operations. Super class references the first position in the NameRXN classification system, i.e., 2.x.x, class the second position, and subclass the final position. Operation Cost Insertion 1 Deletion 1 Rename superclass 0.8 Rename class 0.5 Rename subclass 0 We see from table 1 that there is a higher cost as- sociated with larger hierarchical differences in reac- tion space. Since these differences are more chem- ically dissimilar, the associated cost aims to drive the generated molecules to a more constrained area in chemical space. Additionally, topological differ- ences between synthesis trees, i.e. the addition or removal of reactions, are penalized the most. 3.1.3 Scoring based on Reference Route By providing REINVENT with a reference route and calculating the TED between it and the route for each generated compound, the similarity for 12 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy AiZynthFinderSCORING SMILES Synthesis RoutesScores Query Distance to Reference SMILES REINVENT Figure 3.1: A High-Level Overview of the connection between REINVENT and AiZynthFinder through a scoring component. Here, the scoring component passes the SMILES to AiZynthFinder which returns retrosynthetic suggestions for each molecule. The Scoring component then calculates the distance to reference via TED, and returns the corresponding score to REINVENT. In this view, the scoring component is modeled as a separate entity, even though it is technically a part of REINVENT. each compound can be quantified. In this case, the best scoring route for each molecule is the route that has the lowest TED to a provided reference route. Thus, the provided reference route serves as a template for which reactions are considered good, guiding the generation of molecules in the direction of the reference. This allows for the exploration of compounds when a reaction route is known. In this thesis, two references were considered, of which an abstraction is shown in fig. 3.2. The two routes aimed to occupy ”complex” and ”sim- ple” chemistry, exploring the efficacy of the method when subjected to two different levels of synthetic complexity. They were selected manually, and cat- egorized as complex or simple based on the num- ber of reactions and prevalence of said reactions. In essence, the complex route had more synthesis steps as well as more uncommon reactions than the simple route. 3.1.4 Emergent Route Distance Score In the emergent scoring case, the focus shifts from minimizing the TED to a reference to instead max- imizing the size of a singular group of reaction routes. Groups are based on the NameRXN clas- sification system and are created for each batch of molecules. The process is outlined as follows. For each generated compound, REINVENT re- trieves a corresponding set of reaction routes. Each route is then assigned a shorthand identifier de- noted as the chain of reactions represented by the NameRXN system. The frequency of each route in A B Figure 3.2: Simple synthetic route (A) and complex synthetic route (B). The simple route represents a 2 step synthesis with common reactions, and the complex route is a 4 step synthesis with different and more uncommon reactions. the batch is then recorded in a dictionary, forming groups where the size of each group indicates its prevalence. Then, for each compound, the route that falls into the largest possible group is selected. The scoring component also implements a filter that cumulatively sums the size of each group for each epoch. If the number of molecules exceed the filter threshold, the score is set to 0. The score for each compound is then calculated as S(A) =  ( G(A)−min(G) max(G) )2 if G(A) < F 0 otherwise , (17) where G is the dictionary of groups, A is a SMILES sequence, F is the filter, and G(A) denotes the size of the largest group that is occupied by a route that can synthesize A. Thus, generated molecules that can be synthesized 3.2 Parameter Evaluation 13 by the largest group is favoured. Conversely, small groups will be scored low. Note that the score is evaluated batch-wise, which means that the group sizes, and thus scores, are dynamic. This further encourages the collapse into one large group, as the average score per batch will increase when the num- ber of groups is low. 3.2 Parameter Evaluation REINVENT has many different hyperparameters that can change the outcome of the learning. Thus, in order to properly evaluate the effect of the in- troduced scoring component, the effect of different configurations needs to be explored. Additionally, this provides insight into how an optimal configu- ration might look. Due to computational constraints, the influence of parameters was evaluated using a one-parameter- at-a-time approach, wherein parameters were var- ied from a baseline configuration. In order to gain confidence of a noted influence, three replicates were made. The base configuration consisted of parameters at levels that had shown success when applied to adjacent tasks. In table 2 below, the base configuration can be seen. Table 2: Table showing the configuration for the baseline route distance scoring run. Here, σ refers to the scaling parameter in eq. (14). Parameter Value Batch Size 64 Learning Rate 0.001 σ 128 Randomize Smiles* True Diversity Filter* Identical Murcko Scaffold Scoring Transform* Reverse Sigmoid *Covered in Appendix Some parameters, such as batch size and learning rate, are self-explanatory and will not be explained further. The other parameters, however, will be ex- plained in appendix A. The parameters were var- ied based on this configuration and run for three replicates on the simple reference route. The best- performing configuration is then transplanted to an MPO run for both the simple and complex route, as described below. 3.3 Multiple Parameter Optimization - Realistic Application By subjecting REINVENT to multiple scoring com- ponents simultaneously, REINVENT will attempt to find the subspace of the output that contains high-scoring molecules for all components. This allows for the exploration of the impact that the scoring component has in a realistic application, wherein a user aims to satisfy multiple properties at the same time. A generated molecule is consid- ered a ”hit” if it scores above the supplied threshold values for all of the component scores. In this thesis, MPO is used for the purpose of ex- ploring the performance of the route distance com- ponent when it is joined by multiple scoring com- ponents, aimed at emulating a supposed realistic application of REINVENT. It does this by attempt- ing to replicate a previous study [10], with the added constraint of route similiarity. The scoring components and their settings are described in ap- pendix B, alongside a table showcasing the thresh- old values. 14 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy 4 Results In this section, we present the results for the two scoring components. The results for the reference scoring component are shown as a parameter evalu- ation and an MPO run comparing the performance to a baseline. For the emergent scoring component, only the performance for one MPO run is shown. Here, performance is measured as the cumulative number of hits for an MPO run, where hits refer to generated molecules that score sufficiently high under all active components. The components for this MPO experiment were chosen to replicate a previous study [10], with the added constraint of route similarity. 4.1 Reference Route Scoring Component The results for the reference route scoring compo- nent is presented in two parts. First, as an assese- ment of the parameter evaluation aimed at discov- ering a good configuration of the component, and then as the performance of an MPO run compared to a baseline. 4.1.1 Parameter Search The experiment was set up according to the de- scription in section 3.2, where each parameter was varied from a base configuration and ran for three replicates. The average performance of each evalu- ation can be seen in Figure 4.1 on the next page. Looking at Figure 4.1, it is clear that the expo- nential scoring transform vastly outperforms the other configurations, showing that the majority of the generated compounds have 0 distance to the provided reference at the end of the run. On the contrary, the inverse and lenient scoring transforms produce the lowest number of compounds with 0 distance to the reference for the investigated time- frame. We can also see that increasing the learning rate two-fold resulted in instability, as noted by the decrease in valid routes after around 200 epochs, whereas the halved learning rate slowly increased the mid-low distance compounds but not converg- ing. 4.1.2 MPO The multiple parameter optimization experiment was carried out for the simple and complex route, and compared to a baseline MPO without route distance scoring. As described in section 3.3, a molecule is considered a hit if it exceeds the given score threshold for each component. For the base- line, it means that a hit share route with the refer- ence by pure chance. In Figure 4.2, the number of cumulative hits between a baseline and a run scored with the route distance component is compared. 0 200 400 600 800 1000 Epoch 0 10 20 30 40 50 60 H its Cumulative Sum over Number of Hits - Average of 5 Route Distance Baseline Figure 4.2: Multi Parameter Optimization for the simple route, compared between a baseline and a route distance scoring run. The standard deviation is shown as the translucent spread, calculated over 5 replicates. From Figure 4.2 we see that, while both methods are capable of generating hits with a consistent syn- thetic strategy, the route distance scoring compo- nent outperforms the baseline by having a more productive trajectory, generating more hits per run on average. Using the same experimental setup for the com- plex route, we found that no hits were produced for either the baseline or the route distance scor- ing component during any of the three attempted replicates. This indicates a failure to find a chemi- cal subspace that satisfies all the scores. 4.2 Emergent Route Scoring Emergent scoring is able to discover groups with molecules whose chemical synthesis routes are iden- tical, without requiring an explicit template. In Figure 4.3, an MPO scoring run with the emergent component is shown. 4.2 Emergent Route Scoring 15 TED = 0 0.5 <= TED < 0.8 0.8 <= TED < 1.3 1.3 <= TED 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Baseline 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Penalize Same Smiles 0 200 400 600 800 1000 Epoch 0 20 40 R ou te s Learning Rate 0.0005 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Learning Rate 0.002 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Learning Sigma 256 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Learning Sigma 512 0 200 400 600 800 1000 Epoch 0 25 50 75 100 R ou te s Batch Size 128 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Randomize SMILES Off 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Scoring transform (Exp) 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Scoring transform (Inv) 0 200 400 600 800 1000 Epoch 0 20 40 60 R ou te s Scoring transform (Lenient) Parameter Evaluation of Reference Route Distance Component, Average of 3 Replicates Figure 4.1: Performance evaluation shown as the distance to reference per route, plotted per epoch. Each subplot is titled by the parameter being evaluated. A high fraction of low-distance compounds means better performance. The upper figure shows the cumulative sum over output molecules grouped by reaction route for each epoch, showing how REINVENT is capable of finding a lucrative route and exploiting it until the filter mechanism is activated and the score is reduced to 0. We note that the cumulative sum keeps increasing for some routes even after the fil- ter is activated, a result of the filter mechanism act- ing on the returned scores rather than blocking the generation entirely. From the point of filter activa- tion and onward, there is no reward for generating molecules that belong to that group, however, it is still possible to do so. We see that this behaviour is present throughout the run, where the agent enters 16 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy 0 200 400 600 800 1000 Epoch 0 1000 2000 3000 C um ul at iv e Su m Cumulative Sum of Routes over Epoch Routes 2.1, 2.1 1.6, 2.1 1.2 2.1, 1.3 2.1, 9.3 1.3 2.1 1.6 2.1, 1.6 2.1, 1.2 0 200 400 600 800 1000 Epoch 0 50 100 150 200 C um ul at iv e N um be r of H its Cumulative Sum of Hits by Route over Epoch Routes 2.1, 1.2 2.1, 1.6 2.1, 1.3 Figure 4.3: Upper: Cumulative sum over routes for each epoch. Lower: The cumulative hits produced during the MPO run categorized by route. In the upper plot, the red line corresponds to the filter activation threshold. Additionally, routes with a cumulative sum below 1500 and 25 have been omitted for the upper and lower plots respectively. a productive phase for one particular route until the filter is activated, rapidly proceeding to exploit another route. Looking at the lower part of Figure 4.3, we see the cumulative number of hits. Here, a hit is based on the same criterion as in section 4.1.2, i.e. a molecules has to score sufficiently high score for all present components. We can note that there is a clear overlap between the exploited route in the upper figure and the productive route in the lower figure. However, we also note that the productive phases doesn’t begin until around epoch 700. Up until that point, the agent has not yet learned to reliably produce molecules that perform well under the other scoring components in the MPO. Once the agent has learned sufficiently well, we see that the production of hits coincide with the exploita- tion of a certain route, with short lag times between filter activation and exploitation. Discussion 17 5 Discussion This thesis covered the implementation of two dif- ferent scoring components for generating molecules conditioned on synthetic similarity. In this section, we discuss the results presented in section 4. 5.1 Reference Route Scoring The reference route scoring component is, through the explicit measure of the TED between all gen- erated compounds and the provided reference, able to guide the generation towards a region in chemi- cal space that contains molecules with a consistent synthetic strategy. However, this relies upon the user providing a feasible reference route. Thus, the use of this scoring component is aimed at the cases of drug discovery where a synthetic route is already available. Additionally, the reference routes used in this the- sis were curated manually, with the reasoning pre- sented in section 3.1.3. By performing manual se- lection of routes based on a manufactured notion of complexity considerable bias might have been introduced to the system. It is possible that the selected complex route isn’t synthetically complex, but just synthetically improbable. Here, the role of AiZynthFinder can not be understated, as it aims to minimize the cost of the routes it suggests. This means that shorter routes are inherently fa- vored, leading to a smaller chance of finding iden- tical routes when the reference is long - a valuable insight for the further development of the compo- nent, as it suggests a limitation of the combined system. Overall, a better survey on reaction chem- istry could have allowed for more robust assump- tions, leading to a better evaluation of the compo- nents performance. Furthermore, the parameter evaluation in sec- tion 4.1.1 revealed that the scoring component was very sensitive to which scoring transform was used. This is particularly evident when looking at the exceptional performance of the exponential scor- ing transform. It is interesting to note that the exponential scoring function was something that had not been previously implemented in REIN- VENT. Conversely, the inverse and the lenient scoring transforms performed surprisingly poorly. It is thought that this is due to these scoring func- tions exhibiting a worse payoff - the difference in return for a sequence that is marginally closer to the reference is not high enough to guide the net- work. The higher the payoff, the stronger the pull. 5.1.1 MPO - Real Application The effect of aggregating multiple scoring compo- nents is illustrated in the MPO runs, where multi- ple scoring components interact to make the search space difficult to traverse. However, both the base- line and the scoring component for the simple route manage to generate components that have zero dis- tance to the simple route. The most interesting point of discussion here is the fact that the baseline generated a surprising amount of compounds with zero distance to reference, even though it was never subjected to information about the synthesis routes of the molecules it generated. This could possibly be attributed to the flexible chemistry of the simple reference route, which consists of common reactions that potentially cover the synthesis of many differ- ent compounds, emerging spontaneously as a good candidate route from AiZynthFinder. This is fur- ther strengthened by the high spread seen in the baseline - some runs will, by chance, end up pro- ducing molecules that can be synthesized with the supplied reference route. However, since there is no constraint, other runs will produce molecules that is synthesized by different synthetic routes. 5.2 Emergent Route Scoring The emergent route scoring allows for the discov- ery of synthetically parallelizable compounds with- out the need for any prior knowledge about a com- pounds synthesis. By foregoing the need to evalu- ate similarity between routes, instead taking only equality into account, the method becomes faster and sharper. In the upper part of Figure 4.3, we see how the number of molecules evolve with epoch, showcasing the flexibility and speed of the method - as each filter is applied, the agent rapidly finds a new route to exploit. The emergent scoring, as it has been implemented for this thesis, also removes the need to use a scor- ing transform. This reduces the complexity of the parameter search and removes one moving part from this otherwise very complex system. 18 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy 5.2.1 MPO for the emergent scoring Naturally, the emergent scoring manages to gener- ate more hits than the reference scoring component - removing the restriction of a reference route allows the agent to explore more freely. Now, many solu- tions exist, whereas previously only one would be accepted. What is interesting is that the emergent scoring component seems to retain the ability to generate high-scoring compounds under the rest of the components, even when the filter activates and drops a route score to 0, which can be seen in the later stages of the lower part of Figure 4.3. As the filter is activated, a new productive route is found almost immediately. This is likely a consequence of evaluating the scoring component dynamically and batch-wise, where the formation of a large majority group is strongly encouraged - resulting in a rapid collapse into one productive group. 5.3 Further research A topic for further research would be to make a qualitative analysis of how reference route com- plexity relates to performance, where the notion of complexity is robustly determined. As has been shown in this work, the chosen complex route did not yield any hits for the MPO runs, which could be due to either a destructive relationship between the other components and the complex reference, or due to the fact that AiZynthFinder guides its search towards shorter routes. The correlation be- tween reference route complexity and generative performance should thus be investigated further, in order to learn more about the limits of this com- ponent. Another topic is optimizing the operational weights for TED calculations. The weights used to deter- mine the operational costs for TED, as presented in Table 1, is an area that was not explored in great detail during this thesis, even though it undoubt- edly has a large impact on how the scores are calcu- lated. For example, the lowest possible distance to a compound is 0.5, corresponding to a difference in the ”class” level of a named reaction. This will then be transformed to different values depending on the scoring transform. Thus, since the scoring trans- forms showed great impact on the performance of the component, it is safe to assume that the oper- ational weights of TED would have at least some impact as well. It is thus of great interest to know if additional performance can be gained by adjusting these weights. Finally, while the emergent route distance compo- nent shows great initial promise, it has not been assessed robustly. Minor alterations to configura- tions have been attempted based on intuition, as opposed to the systematic evaluation seen in Fig- ure 4.1 for the reference route component. Perhaps there are further performance gains to be discov- ered. Conclusion 19 6 Conclusion In this thesis we have demonstrated a novel method for generating molecules with a consistent synthetic route. By connecting REINVENT with AiZyn- thFinder, two distinct scoring components were implemented - one which is capable of generating molecules whose synthetic route adhere to an ex- ternal reference, and one that emergently discovers molecules with identical synthetic pathway. This provides the answers to the questions that this the- sis sought to adress. Furthermore, we are able to show that the scoring components can be used in junction with other components, allowing for the exploration of molecules that fit a multitude of de- sign criteria, whilst now also being synthesizable in parallel. It is our hope that this will facilitate the development of new pharmaceutical treatments by enhancing the process of drug discovery. 20 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy References [1] J. Hughes, S. Rees, S. Kalindjian, and K. Philpott, “Principles of early drug discovery,” British Journal of Pharmacology, vol. 162, no. 6, pp. 1239–1249, 2011, issn: 1476-5381. doi: 10. 1111 / j . 1476 - 5381 . 2010 . 01127 . x. [Online]. Available: https://onlinelibrary.wiley.com/ doi/abs/10.1111/j.1476-5381.2010.01127.x (visited on 04/23/2024). [2] M. K. Mahapatra and M. Karuppasamy, “Funda- mental considerations in drug design,” Computer Aided Drug Design (CADD): From Ligand-Based Methods to Structure-Based Approaches, pp. 17– 55, 2022. doi: 10.1016/B978- 0- 323- 90608- 1.00005- 8. [Online]. Available: https://www. ncbi.nlm.nih.gov/pmc/articles/PMC9212230/ (visited on 05/09/2024). [3] S. Mandal, M. Moudgil, and S. K. Mandal, “Ra- tional drug design,” eng, European Journal of Pharmacology, vol. 625, no. 1-3, pp. 90–100, Dec. 2009, issn: 1879-0712. doi: 10.1016/j.ejphar. 2009.06.065. [4] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke, “The rise of deep learning in drug discovery,” Drug Discovery Today, vol. 23, no. 6, pp. 1241–1250, Jun. 2018, issn: 1359-6446. doi: 10.1016/j.drudis.2018.01.039. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S1359644617303598 (vis- ited on 04/15/2024). [5] A. W. Dombrowski, A. L. Aguirre, A. Shrestha, K. A. Sarris, and Y. Wang, “The Chosen Few: Parallel Library Reaction Methodologies for Drug Discovery,” eng, The Journal of Organic Chemistry, vol. 87, no. 4, pp. 1880–1897, Feb. 2022, issn: 1520-6904. doi: 10.1021/acs.joc. 1c01427. [6] H. H. Loeffler et al., “Reinvent 4: Modern AI– driven generative molecule design,” Journal of Cheminformatics, vol. 16, no. 1, p. 20, Feb. 2024, issn: 1758-2946. doi: 10 . 1186 / s13321 - 024 - 00812 - 5. [Online]. Available: https : / / doi . org / 10 . 1186 / s13321 - 024 - 00812 - 5 (visited on 03/18/2024). [7] S. Genheden, A. Thakkar, V. Chadimová, J.-L. Reymond, O. Engkvist, and E. Bjerrum, “AiZyn- thFinder: A fast, robust and flexible open-source software for retrosynthetic planning,” Journal of Cheminformatics, vol. 12, no. 1, p. 70, Nov. 2020, issn: 1758-2946. doi: 10 . 1186 / s13321 - 020 - 00472 - 1. [Online]. Available: https : / / doi . org / 10 . 1186 / s13321 - 020 - 00472 - 1 (visited on 03/18/2024). [8] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular de-novo design through deep reinforcement learning,” Journal of Cheminfor- matics, vol. 9, no. 1, p. 48, Sep. 2017, issn: 1758- 2946. doi: 10.1186/s13321-017-0235-x. [On- line]. Available: https : / / doi . org / 10 . 1186 / s13321-017-0235-x (visited on 04/18/2024). [9] D. Weininger, “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules,” Journal of Chemical Information and Computer Sciences, vol. 28, no. 1, pp. 31–36, Feb. 1988, Publisher: American Chemical Society, issn: 0095-2338. doi: 10.1021/ci00057a005. [Online]. Available: https://doi.org/10.1021/ci00057a005 (vis- ited on 04/15/2024). [10] M. Dodds, J. Guo, T. Löhr, A. Tibo, O. Engkvist, and J. Paul Janet, “Sample efficient reinforce- ment learning with active learning for molecular design,” en, Chemical Science, vol. 15, no. 11, pp. 4146–4160, 2024. doi: 10.1039/D3SC04653B. [Online]. Available: https://pubs.rsc.org/en/ content/articlelanding/2024/sc/d3sc04653b (visited on 05/09/2024). [11] M. H. S. Segler, M. Preuss, and M. P. Waller, “Planning chemical syntheses with deep neural networks and symbolic AI,” en, Nature, vol. 555, no. 7698, pp. 604–610, Mar. 2018, Publisher: Na- ture Publishing Group, issn: 1476-4687. doi: 10. 1038/nature25978. [Online]. Available: https: //www.nature.com/articles/nature25978 (vis- ited on 05/07/2024). [12] D. Weininger, A. Weininger, and J. L. Weininger, “SMILES. 2. Algorithm for generation of unique SMILES notation,” Journal of Chemical Infor- mation and Computer Sciences, vol. 29, no. 2, pp. 97–101, May 1989, Publisher: American Chemical Society, issn: 0095-2338. doi: 10 . 1021/ci00062a008. [Online]. Available: https: //doi.org/10.1021/ci00062a008 (visited on 04/18/2024). [13] RDKit. [Online]. Available: https://www.rdkit. org/ (visited on 06/10/2024). [14] I. Lagersted, J. Mayfiel, and R. Sayl, NameRx More than just a Reaction Classifier, en, 2021. [15] M. Pawlik and N. Augsten, “Tree edit distance: Robust and memory-efficient,” Information Sys- tems, vol. 56, pp. 157–173, Mar. 2016, issn: 0306- 4379. doi: 10.1016/j.is.2015.08.004. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0306437915001611 (vis- ited on 03/18/2024). [16] B. Mehlig, Machine Learning with Neural Net- works: An Introduction for Scientists and Engi- https://doi.org/10.1111/j.1476-5381.2010.01127.x https://doi.org/10.1111/j.1476-5381.2010.01127.x https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1476-5381.2010.01127.x https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1476-5381.2010.01127.x https://doi.org/10.1016/B978-0-323-90608-1.00005-8 https://doi.org/10.1016/B978-0-323-90608-1.00005-8 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9212230/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9212230/ https://doi.org/10.1016/j.ejphar.2009.06.065 https://doi.org/10.1016/j.ejphar.2009.06.065 https://doi.org/10.1016/j.drudis.2018.01.039 https://www.sciencedirect.com/science/article/pii/S1359644617303598 https://www.sciencedirect.com/science/article/pii/S1359644617303598 https://doi.org/10.1021/acs.joc.1c01427 https://doi.org/10.1021/acs.joc.1c01427 https://doi.org/10.1186/s13321-024-00812-5 https://doi.org/10.1186/s13321-024-00812-5 https://doi.org/10.1186/s13321-024-00812-5 https://doi.org/10.1186/s13321-024-00812-5 https://doi.org/10.1186/s13321-020-00472-1 https://doi.org/10.1186/s13321-020-00472-1 https://doi.org/10.1186/s13321-020-00472-1 https://doi.org/10.1186/s13321-020-00472-1 https://doi.org/10.1186/s13321-017-0235-x https://doi.org/10.1186/s13321-017-0235-x https://doi.org/10.1186/s13321-017-0235-x https://doi.org/10.1021/ci00057a005 https://doi.org/10.1021/ci00057a005 https://doi.org/10.1039/D3SC04653B https://pubs.rsc.org/en/content/articlelanding/2024/sc/d3sc04653b https://pubs.rsc.org/en/content/articlelanding/2024/sc/d3sc04653b https://doi.org/10.1038/nature25978 https://doi.org/10.1038/nature25978 https://www.nature.com/articles/nature25978 https://www.nature.com/articles/nature25978 https://doi.org/10.1021/ci00062a008 https://doi.org/10.1021/ci00062a008 https://doi.org/10.1021/ci00062a008 https://doi.org/10.1021/ci00062a008 https://www.rdkit.org/ https://www.rdkit.org/ https://doi.org/10.1016/j.is.2015.08.004 https://www.sciencedirect.com/science/article/pii/S0306437915001611 https://www.sciencedirect.com/science/article/pii/S0306437915001611 REFERENCES 21 neers. Cambridge: Cambridge University Press, 2021, isbn: 978-1-108-49493-9. doi: 10 . 1017 / 9781108860604. [Online]. Available: https : / / www . cambridge . org / core / books / machine - learning - with - neural - networks / 0028D1883AF842C81340508119AB6491 (visited on 05/07/2024). [17] A. Sherstinsky, “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network,” Physica D: Nonlin- ear Phenomena, vol. 404, p. 132 306, Mar. 2020, arXiv:1808.03314 [cs, stat], issn: 01672789. doi: 10.1016/j.physd.2019.132306. [Online]. Avail- able: http://arxiv.org/abs/1808.03314 (vis- ited on 04/23/2024). [18] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv:1412.3555 [cs], Dec. 2014. [Online]. Available: http : / / arxiv . org / abs / 1412 . 3555 (visited on 04/19/2024). [19] S. Hochreiter and J. Schmidhuber, “Long Short- term Memory,” Neural computation, vol. 9, pp. 1735–80, Dec. 1997. doi: 10 . 1162 / neco . 1997.9.8.1735. [20] R. Sutton and A. Barto, “Reinforcement Learn- ing: An Introduction,” IEEE Transactions on Neural Networks, vol. 9, no. 5, pp. 1054–1054, Sep. 1998, Conference Name: IEEE Transactions on Neural Networks, issn: 1941-0093. doi: 10. 1109 / TNN . 1998 . 712192. [Online]. Available: https : / / ieeexplore . ieee . org / document / 712192 (visited on 04/18/2024). [21] Z. Wang, W. Zhao, G. Hao, and B. Song, “Map- ping the resources and approaches facilitating computer-aided synthesis planning,” en, Organic Chemistry Frontiers, vol. 8, no. 4, pp. 812–824, Feb. 2021, Publisher: The Royal Society of Chem- istry, issn: 2052-4129. doi: 10.1039/D0QO00946F. [Online]. Available: https://pubs.rsc.org/en/ content/articlelanding/2021/qo/d0qo00946f (visited on 05/07/2024). [22] E. J. Corey and W. T. Wipke, “Computer- Assisted Design of Complex Organic Syntheses,” Science, vol. 166, no. 3902, pp. 178–192, Oct. 1969, Publisher: American Association for the Advancement of Science. doi: 10.1126/science. 166 . 3902 . 178. [Online]. Available: https : / / www.science.org/doi/10.1126/science.166. 3902.178 (visited on 05/07/2024). [23] W.-D. Ihlenfeldt and J. Gasteiger, “Computer- Assisted Planning of Organic Syntheses: The Second Generation of Programs,” en, Ange- wandte Chemie International Edition in English, vol. 34, no. 23-24, pp. 2613–2633, 1996, issn: 1521-3773. doi: 10.1002/anie.199526131. [On- line]. Available: https://onlinelibrary.wiley. com/doi/abs/10.1002/anie.199526131 (visited on 05/07/2024). [24] S. Hong, H. H. Zhuo, K. Jin, G. Shao, and Z. Zhou, “Retrosynthetic planning with experience- guided Monte Carlo tree search,” en, Communi- cations Chemistry, vol. 6, no. 1, pp. 1–14, Jun. 2023, Publisher: Nature Publishing Group, issn: 2399-3669. doi: 10.1038/s42004- 023- 00911- 8. [Online]. Available: https : / / www . nature . com / articles / s42004 - 023 - 00911 - 8 (visited on 05/07/2024). [25] C. W. Coley, L. Rogers, W. H. Green, and K. F. Jensen, “Computer-Assisted Retrosynthe- sis Based on Molecular Similarity,” ACS Cen- tral Science, vol. 3, no. 12, pp. 1237–1245, Dec. 2017, Publisher: American Chemical Society, issn: 2374-7943. doi: 10 . 1021 / acscentsci . 7b00355. [Online]. Available: https : / / doi . org/10.1021/acscentsci.7b00355 (visited on 05/07/2024). [26] H. Dai, C. Li, C. W. Coley, B. Dai, and L. Song, Retrosynthesis Prediction with Conditional Graph Logic Network, arXiv:2001.01408 [cs, stat], Jan. 2020. doi: 10.48550/arXiv.2001.01408. [Online]. Available: http : / / arxiv . org / abs / 2001.01408 (visited on 05/07/2024). [27] P. Ertl and A. Schuffenhauer, “Estimation of syn- thetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions,” Journal of Cheminformatics, vol. 1, no. 1, p. 8, Jun. 2009, issn: 1758-2946. doi: 10.1186/1758-2946-1-8. [Online]. Avail- able: https://doi.org/10.1186/1758-2946-1- 8 (visited on 05/07/2024). [28] C. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Compu- tational Intelligence and AI in Games, vol. 4:1, pp. 1–43, Mar. 2012. doi: 10 . 1109 / TCIAIG . 2012.2186810. [29] M. Świechowski, K. Godlewski, B. Sawicki, and J. Mańdziuk, “Monte Carlo Tree Search: A re- view of recent modifications and applications,” en, Artificial Intelligence Review, vol. 56, no. 3, pp. 2497–2562, Mar. 2023, issn: 1573-7462. doi: 10.1007/s10462-022-10228-y. [Online]. Avail- able: https://doi.org/10.1007/s10462-022- 10228-y (visited on 05/07/2024). [30] M. Pawlik and N. Augsten, “Efficient Computa- tion of the Tree Edit Distance,” ACM Transac- tions on Database Systems, vol. 40, no. 1, 3:1– 3:40, Mar. 2015, issn: 0362-5915. doi: 10.1145/ https://doi.org/10.1017/9781108860604 https://doi.org/10.1017/9781108860604 https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491 https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491 https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491 https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491 https://doi.org/10.1016/j.physd.2019.132306 http://arxiv.org/abs/1808.03314 http://arxiv.org/abs/1412.3555 http://arxiv.org/abs/1412.3555 https://doi.org/10.1162/neco.1997.9.8.1735 https://doi.org/10.1162/neco.1997.9.8.1735 https://doi.org/10.1109/TNN.1998.712192 https://doi.org/10.1109/TNN.1998.712192 https://ieeexplore.ieee.org/document/712192 https://ieeexplore.ieee.org/document/712192 https://doi.org/10.1039/D0QO00946F https://pubs.rsc.org/en/content/articlelanding/2021/qo/d0qo00946f https://pubs.rsc.org/en/content/articlelanding/2021/qo/d0qo00946f https://doi.org/10.1126/science.166.3902.178 https://doi.org/10.1126/science.166.3902.178 https://www.science.org/doi/10.1126/science.166.3902.178 https://www.science.org/doi/10.1126/science.166.3902.178 https://www.science.org/doi/10.1126/science.166.3902.178 https://doi.org/10.1002/anie.199526131 https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.199526131 https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.199526131 https://doi.org/10.1038/s42004-023-00911-8 https://doi.org/10.1038/s42004-023-00911-8 https://www.nature.com/articles/s42004-023-00911-8 https://www.nature.com/articles/s42004-023-00911-8 https://doi.org/10.1021/acscentsci.7b00355 https://doi.org/10.1021/acscentsci.7b00355 https://doi.org/10.1021/acscentsci.7b00355 https://doi.org/10.1021/acscentsci.7b00355 https://doi.org/10.48550/arXiv.2001.01408 http://arxiv.org/abs/2001.01408 http://arxiv.org/abs/2001.01408 https://doi.org/10.1186/1758-2946-1-8 https://doi.org/10.1186/1758-2946-1-8 https://doi.org/10.1186/1758-2946-1-8 https://doi.org/10.1109/TCIAIG.2012.2186810 https://doi.org/10.1109/TCIAIG.2012.2186810 https://doi.org/10.1007/s10462-022-10228-y https://doi.org/10.1007/s10462-022-10228-y https://doi.org/10.1007/s10462-022-10228-y https://doi.org/10.1145/2699485 22 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy 2699485. [Online]. Available: https://doi.org/ 10.1145/2699485 (visited on 03/18/2024). https://doi.org/10.1145/2699485 https://doi.org/10.1145/2699485 https://doi.org/10.1145/2699485 https://doi.org/10.1145/2699485 APPENDIX 1 Appendix A REINVENT configuration Options This appendix aims to cover the hyper-parameters and scoring components, explaining their function and use-case. A.1 Randomizing SMILES This is a configuration that allows REINVENT to apply SMILES randomization to its generated com- pounds, wherein a SMILES is de-canonicalized and instead represented as a random valid SMILES string. This can, in some cases, facilitate the learning process, as the same molecule can be rewarded, corresponding to different sequences. A.2 Diversity Filters The diversity filters serve to introduce diversity by lowering the scores for repeatedly-seen molecules, decided on by some metric. In other words, some measure is employed to measure similarity between molecules, and too similar molecules are penalized. It has a few tunable parameters: Table 3: Hyperparameters for Diversity Filter in REINVENT Hyperparameter Description Diversity Metric The metric used to calculate the similarity between molecules. Tanimoto similarity on molecular finger- prints is common. Score Threshold The score after which generated molecules will be put into memory, known as the ”bucket”. ”Bucket” Size The size of the memory (number of molecules) for molecules that exceed the threshold. Penalty Multiplier A scalar weight w ∈ [0, 1] that is multiplied with the score, determining the harshness of the penalty. A.2.1 Penalize Same Smiles This diversity filter penalizes generation of identical canonical SMILES, which means that multiples of the same molecules will be scored low. This enforces uniqueness. A.2.2 Identical Murcko Scaffold This filter penalizes the generation of molecules that have an identical Murcko scaffold. A Murcko scaffold is the core structure of a molecule, such as chains and ring structures, but not counting the substructures and substituents. This filter is thus much more strict in promoting diverse chemistry. A.3 Scoring Transforms This section aims to elaborate on the concept of scoring transforms. Given a score S(A) for a sequence A, consider the transform T : R → [0, 1]. The function T aims to transform the score, which is not necessarily bounded, to the bounded domain [0, 1]. Since various scores are measured on different scales, it is necessary to normalize them to such a range. In REINVENT, this is commonly done with a sigmoid or step function. Some examples will follow. 2 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy A.3.1 Parameterized Sigmoid Scoring Transform The sigmoid scoring transform is a common method for normalizing scores. By parameterizing, it can be made more expressive. The parameterized sigmoid function can be defined as: T (x) = 1 1 + ek(x−M) , M = L + H 2 . where x is the input score, k is a steepness factor, L is the lower threshold and H is the upper threshold. This function maps any real number x to the interval [0, 1]. The sigmoid function is smooth and has an S-shaped curve, with a midpoint at M . It can be either a left-sigmoid or a right-sigmoid based on the sign of the exponent. The combination of both creates a double-sigmoid, where a high scoring ”window” is formed between them. A.3.2 Truncated Exponential Scoring Transform Another scoring transform is a truncated exponential function, on the form T (x) = min(1, exp(−x)). Which provides different characteristics to the sigmoid function. When considering the Route Distance Score, which has its best possible value at 0, this scoring has a good fit. Most notably, the maximum return coincides with the best value of the route distance score, and the slope is concave up, generating increasing rewards as the score approaches 0. A.3.3 Lenient Scoring Transform The lenient scoring transform is a special configuration of the parameterized sigmoid scoring function, where the steepness factor is set to a very low value. This creates a very slight and gradual slope. A.3.4 Truncated Inverse Scoring Transform The truncated inverse scoring transform is written as T (x) = min(1, 1 1 + x ), x ≥ 0. Similarly to the exponential scoring function, it has a maximum at x = 0, and a Appendix B Scoring Components in MPO The MPO runs performed in this thesis consisted of multiple scoring components, which are explained here. B.1 Molecular Weight The molecular weight scoring component scores the generated molecules based on their molecular weight. In the presented MPO runs, this was done with a double sigmoid with lower threshold 200 g/mol and upper threshold 550 g/mol. APPENDIX 3 B.2 ROCS Similarity This scoring component evaluates a molecule based on it’s Rapid Overlay of Chemical Structure (ROCS) Similiarity to a provided Template. It is a measure of structural similarity between a generated molecule and a provided template structure, returning a higher score for higher similarity. B.3 Custom Alerts The custom alers flags specific substructures in a generated molecule by SMILES Arbitrary Target Specification (SMARTS). For the MPO run it was selected to remove unstable groups and un-physical ring sizes. B.4 Quantitative Estimate of Drug-Likeness (QED) The QED component returns a numerical estimate of drug-likeness for a molecule in the range [0, 1]. A higher value means that it has a higher estimate of being a drug-like compound. B.5 Hydrogen Bond Donors The Hydrogen Bond Donors component measures a molecules fitness to ”Lipinskis Rule of Five”, which states that a drug compound in general has no more than 5 hydrogen bond donors in its structure. This is done by measuring the number of Hydrogen Bond Donors in the molecule and taking a right-sigmoid with upper threshold 6 and lower threshold 2. Introduction REINVENT AiZynthFinder Aim Theory Chemistry Representation of Molecular Structure Chemical Synthesis Retrosynthesis Classification of Molecular Reactions Tree Edit Distance REINVENT Recurrent Neural Networks Gated Recurrent Unit (GRU) RNN training on Next-Token Prediction Sequence Generation with RNNs Reinforcement Learning Policy-Gradient Methods REINFORCE Fine Tuning Molecular Generators Scoring System Multiple Parameter Optimization AiZynthFinder Computer Aided Synthesis Planning Monte Carlo Tree Search (MCTS) Neural Network guided MCTS Methods Molecular Generation conditioned on Synthetic Similarity Connecting AiZynthFinder and REINVENT Route Similarity in REINVENT Scoring based on Reference Route Emergent Route Distance Score Parameter Evaluation Multiple Parameter Optimization - Realistic Application Results Reference Route Scoring Component Parameter Search MPO Emergent Route Scoring Discussion Reference Route Scoring MPO - Real Application Emergent Route Scoring MPO for the emergent scoring Further research Conclusion Appendix REINVENT configuration Options Randomizing SMILES Diversity Filters Penalize Same Smiles Identical Murcko Scaffold Scoring Transforms Parameterized Sigmoid Scoring Transform Truncated Exponential Scoring Transform Lenient Scoring Transform Truncated Inverse Scoring Transform Appendix Scoring Components in MPO Molecular Weight ROCS Similarity Custom Alerts Quantitative Estimate of Drug-Likeness (QED) Hydrogen Bond Donors