MASTER’S THESIS 2024

De NovoMolecular Generation of
Molecules with Consistent Synthetic Strategy

ALBIN EKBORG

DEPARTMENT OF PHYSICS
CHALMERS UNIVERSITY OF TECHNOLOGY

Gothenburg, Sweden 2024


De Novo Molecular Generation of
Molecules with Consistent Synthetic Strategy

ALBIN EKBORG

© ALBIN EKBORG, 2024.

Supervisor: Jon Paul Janet, AstraZeneca
Supervisor: Alexey Voronov, AstraZeneca

Examiner: Mats Granath, Department of Physics

Master’s Thesis 2024
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2024


De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy
Albin Ekborg
Department of Physics
Chalmers University of Technology

Abstract

The discovery of new pharmaceutical compounds is essential for developing novelmedicines,
a process that is both costly and time-consuming. In this context, generative Artificial Intelli-
gence (AI) methods offer a valuable, cost-effective approach for generating large amounts of
de novo pharmaceutical candidates. However, these drug candidates must still be synthesized
for use in in vitro and in vivo studies, necessitating the creation of efficiently synthesizable
drug candidates. In this thesis, we explore two AI models for drug design developed by As-
traZeneca: REINVENT and AiZynthFinder. REINVENT is a generative AI model designed
for creating small molecules fine-tuned with Reinforcement Learning (RL), whereas AiZyn-
thFinder is a framework for retrosynthesis. We explore the combination of these models
through two novel reward functions and demonstrate that REINVENT, through querying
of AiZynthFinder, can be conditioned to generate molecules that share an identical synthetic
route. This is done in twoways, by provision of a template reaction, or by letting REINVENT
independently find identical routes, allowing the generated molecules to be synthesized in
parallel.

KEYWORDS: REINVENT, AIZYNTHFINDER, DEEP LEARNING, DRUG DESIGN.


*


Acknowledgements

First of all, I would like to thank my supervisors, Alexey Voronov and Jon Paul Janet, as well
as my examiner Mats Granath, for providing me with guidance and support throughout my
thesis. Furthermore, I would like to thank Elias Stenhede Johansson andMauritz Kööhler for
their detailed proofreading and comprehensive feedback, bringing much-needed clarity to
otherwise convoluted sentences. Finally, I would like to thankWojciech & Joel - my master’s
studies just wouldn’t have been the same without you.

Albin Ekborg, Gothenburg, June 2024


*


Table of Contents

1 Introduction 1
1.1 REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 AiZynthFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Chemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Representation of Molecular Structure . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Chemical Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Retrosynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.4 Classification of Molecular Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.5 Tree Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 RNN training on Next-Token Prediction . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Sequence Generation with RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.6 Policy-Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.7 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.8 Fine Tuning Molecular Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.9 Scoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.10 Multiple Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 AiZynthFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Computer Aided Synthesis Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Monte Carlo Tree Search (MCTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Neural Network guided MCTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Methods 11
3.1 Molecular Generation conditioned on Synthetic Similarity . . . . . . . . . . . . . . . . . . 11

3.1.1 Connecting AiZynthFinder and REINVENT . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Route Similarity in REINVENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Scoring based on Reference Route . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.4 Emergent Route Distance Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


3.2 Parameter Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Multiple Parameter Optimization - Realistic Application . . . . . . . . . . . . . . . . . . . 13

4 Results 14
4.1 Reference Route Scoring Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1 Parameter Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 MPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Emergent Route Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Discussion 17
5.1 Reference Route Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 MPO - Real Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Emergent Route Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.1 MPO for the emergent scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Conclusion 19

Appendix A REINVENT configuration Options 1
A.1 Randomizing SMILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A.2 Diversity Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A.2.1 Penalize Same Smiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A.2.2 Identical Murcko Scaffold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A.3 Scoring Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A.3.1 Parameterized Sigmoid Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . 2
A.3.2 Truncated Exponential Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . 2
A.3.3 Lenient Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A.3.4 Truncated Inverse Scoring Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Appendix B Scoring Components in MPO 2
B.1 Molecular Weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
B.2 ROCS Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B.3 Custom Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B.4 Quantitative Estimate of Drug-Likeness (QED) . . . . . . . . . . . . . . . . . . . . . . . . 3
B.5 Hydrogen Bond Donors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

Albin Ekborg

Chalmers University of Technology

June 2024

Abstract
The discovery of new pharmaceutical compounds is essential for developing novel medicines, a process
that is both costly and time-consuming. In this context, generative Artificial Intelligence (AI) methods
offer a valuable, cost-effective approach for generating large amounts of de novo pharmaceutical candi-
dates. However, these drug candidates must still be synthesized for use in in vitro and in vivo studies,
necessitating the creation of efficiently synthesizable drug candidates. In this thesis, we explore two AI
models for drug design developed by AstraZeneca: REINVENT and AiZynthFinder. REINVENT is
a generative AI model designed for creating small molecules fine-tuned with Reinforcement Learning
(RL), whereas AiZynthFinder is a framework for retrosynthesis. We explore the combination of these
models through two novel reward functions and demonstrate that REINVENT, through querying of
AiZynthFinder, can be conditioned to generate molecules that share an identical synthetic route. This
is done in two ways, by provision of a template reaction, or by letting REINVENT independently find
identical routes, allowing the generated molecules to be synthesized in parallel.

1 Introduction

The process of taking a drug from discovery to mar-
ket can take up to 12-15 years, and cost in excess of
$1 billion [1]. During this process, a drug candidate
undergoes heavy scrutiny in regard to its safety and
efficacy. If a drug fails, it is due to it either not be-
ing safe, or not being efficient. As such, being able
to discover pharmaceutical candidates that already
satisfy the requirements of safety and efficacy poses
tremendous value for the pharmaceutical industry.

The process of finding novel pharmaceutical candi-
dates has traditionally relied on rational drug de-
sign, a systematic approach of designing a molecule
to fit a predefined set of properties as well as re-
quirements on safety and efficacy [2, 3]. This re-
quires both conscious effort and detailed knowledge
of a therapeutic target, which is time-consuming
and costly. In this context, the prospect of leverag-
ing generative Artificial Intelligence (AI) methods
becomes valuable as a cost-effective approach for
generating a large number of de novo pharmaceu-
tical candidates [4].

While being able to generate a large number of po-
tential drug candidates is useful in its own right,
drug candidates still need to be synthesized for use
in in vitro and in vivo studies. A common in vitro
method for synthesizing a large number of poten-
tial drug candidates is Parallel Library Synthesis
(PLS) [5]. In PLS, the drug candidates are system-
atically synthesized in parallel under the same con-
ditions, enabling the efficient creation of chemical
libraries containing different potential drug candi-
dates. This process is highly beneficial for drug dis-
covery as it allows for the rapid creation of chem-
ical compounds in parallel. However, it requires
that the molecules share a synthetic pathway, and
therefore, we need design methods that can produce
sets of candidates with a consistent strategy.

Thus, there is a compelling case for a generative
AI model that is capable of performing molecular
design under the constraint of synthetic similarity,
such that the downstream synthesis can be done in
parallel. In this thesis, we present one implemen-
tation of such a method, relying on two AI frame-
works developed by AstraZeneca, REINVENT and


2 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

AiZynthFinder [6, 7]. A brief introduction to the
systems will follow.

1.1 REINVENT

REINVENT is the name of a neural-network-based
generative AI framework that performs the design
of small molecules [6, 8]. A REINVENT model
is a Recurrent Neural Network (RNN) trained to
generate molecular sequences known as Simplified
Molecular Input Line Entry System (SMILES) [9].
A REINVENT model (henceforth referred to as an
agent) can be additionally fine-tuned to generate
SMILES pertaining to a certain chemical nature
through the use of Reinforcement Learning (RL),
which has been shown to be a flexible and reliable
method of guiding the agent to specific regions of
chemical space [6, 8], while being able to satisfy
multiple constraints at the same time [10]. This
fine-tuning mechanism is noteworthy, as it can ef-
fectively filter the generation of compounds from
the set of all possible molecules to a subset with de-
sirable properties, as distinguished by certain scor-
ing components.

A scoring component in REINVENT is any func-
tion that acts on a generated molecule and returns
a value between 0 and 1, denoting its desirabil-
ity. It is a very important concept as it is the
primary mechanism through which molecular gen-
eration can be guided. The mechanism is explained
in detail in Section 2.

1.2 AiZynthFinder

AiZynthFinder is the name of an algorithmic syn-
thesis planning software that breaks down a tar-
get molecule into a tree of its constituent reagents
and reactions using different search algorithms, pri-
marily by neural network-guided Monte Carlo Tree
Search (MCTS) [7, 11]. It attempts to find plau-
sible synthetic routes by recursively decomposing
the target molecule into simpler precursors through
a series of potential chemical reactions, ultimately
seeking to reach commercially available building
blocks.

The software utilizes a database of known reac-
tions, such as the United States Patent and Trade-
mark Office (USPTO) repository, to guide its de-
composition strategy. The use of neural networks

allows it to predict the feasibility of chemical trans-
formations, while the MCTS algorithm ensures ef-
ficient exploration of the search space. By priori-
tizing routes based on predicted feasibility, AiZyn-
thFinder is able to propose synthesis pathways that
are not only plausible but also optimal in terms of
cost, time, and resource utilization.

1.3 Aim

Combining REINVENT and AiZynthFinder offers
an interesting pipeline for designing novel drug
candidates, where REINVENT generates molecu-
lar structures that adhere to specific design goals,
and AiZynthFinder is able to plan their synthetic
route.

In this thesis, we will explore a method for generat-
ing compounds constrained by synthetic similarity.
The objective is to determine whether the genera-
tion of molecules can be guided by their synthetic
routes through the combined use of AiZynthFinder
and REINVENT, integrated as a scoring compo-
nent where the desirability of a generated molecule
is determined based on its synthetic route, as sug-
gested by AiZynthFinder. Specifically, the thesis
aims to address the following questions:

• Using REINVENT in conjunction with AiZyn-
thFinder, can we generate molecules that have
a synthetic route identical to an external ref-
erence?

• Is it possible to discover molecules that share a
synthetic route emergently, without providing
a reference?


Theory 3

2 Theory

The following section is divided into three parts,
Chemistry, REINVENT, and AiZynthFinder. The
implementations of these concepts are described in
section 3. It is assumed that the reader is familiar
with basic topics in machine learning.

2.1 Chemistry

This section aims to provide a brief overview of
the chemical theory necessary to understand how
the systems are implemented. More specifically,
this section touches upon molecular representa-
tions, synthesis, and classification.

2.1.1 Representation of Molecular Structure

The structure of a molecule can be represented se-
quentially through the use of Simplified Molecular
Input Line Entry System (SMILES), an ASCII-
compatible molecular representation [9] where
atoms and bonds are represented using common
symbols. It is possible to describe a given molecule
in multiple valid ways, which can cause ambi-
guity [12]. To remedy this, SMILES are often
canonicalized. The canonical SMILES ensure that
the SMILES is a unique identifier for a particular
molecule. Figure 2.1 shows an example of a struc-
tural representation of the molecule caffeine, as well
as its corresponding Canonical SMILES.

O N
N

N

O

N

SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C

Figure 2.1: Molecular structure of the molecule
caffeine, alongside its SMILES representation.

SMILES provide a natural way of working with
molecules in a sequential manner, lending itself to
the application of machine learning models.

2.1.2 Chemical Synthesis

Chemical synthesis is the process of combining
molecules under certain conditions (chemical reac-
tions) to form other molecules (products). It can
be viewed as a tree, wherein multiple reactions
are connected in a sequential or branched man-
ner, where each node represents a molecule or a
chemical reaction, and a chemical reaction connects
two molecular nodes. See Figure 2.2 for a chemi-
cal synthesis tree. In this work, synthetic tree and
synthetic route will be used interchangeably. For

Figure 2.2: A 2-step chemical synthesis tree, showing
the synthesis of the fungal compound muscimol. Here,
chemical reactions are signified by •, whereas the
nodes containing molecular structures are different
compounds partaking in the reactions. The figure was
rendered using the Python package RDKit [13].

the rest of this thesis, a more compact illustration
of synthetic trees will be favored, where molecular
structures are omitted.

2.1.3 Retrosynthesis

Retrosynthesis can intuitively be thought of (but
not explicitly considered) as the inverse opera-
tion of chemical synthesis. Instead of designing
a pathway from reactants to a final product, ret-
rosynthesis works backward from the desired target
molecule, identifying simpler precursor compounds
that can lead to its synthesis. Essentially, it poses
an answer to the question ”How can this molecule
be synthesized?”.

2.1.4 Classification of Molecular Reactions

In this thesis, molecular reactions are classified
according to the reaction classification system
NameRxn [14], which provides a hierarchical ap-
proach to naming chemical reactions. The hierar-
chy consists of three levels, a superclass, class, and


4 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

subclass. The hierarchical nature takes chemical
similarity into account, wherein two distinct reac-
tions belonging to the same superclass are inher-
ently more chemically similar than two reactions
belonging to two different superclasses. In Exam-
ple 1 below, the class and name of three chemical
reactions are shown.

Example 1:

2.1.1 Amide Schotten-Baumann Reaction
2.1.8 Thioamide Schotten-Baumann Reaction
3.11.130 Piancatelli rearrangement

It should be apparent that the classification is hi-
erarchical: 2.x.x covers one type of chemical re-
action, with the 2.1.x category specifically cov-
ering ’Schotten-Baumann’ reactions. In contrast,
3.11.130 represents a completely different class of
chemical reactions.

2.1.5 Tree Edit Distance

Tree Edit Distance (TED) is a similarity metric
for graphs, which quantifies the similarity between
a pair of trees by measuring the operational cost
of transforming one tree into the other [15]. The
available operations are insertion, deletion and re-
naming of labels. These operations, performed on
pairwise comparisons of node or edge labels in a
tree, are all that is necessary to reach identity be-
tween two trees.

If two trees are topologically identical but have
different node features, the renaming of said fea-
tures is all that is required to reach identity. Sim-
ilarly, if two trees have an identical root structure
but a different branch, the insertion or deletion of
said branch is all that is required to reach identity.
What TED concerns is the finding of the minimal
cost of the sequence of operations performed be-
tween two trees. By assigning a different cost to
each of these operations, one can tailor the metric
for specific purposes.

This thesis implements TED such that chemical
relevance between chemical reactions in synthesis
trees is accounted for. In order to measure the
similarity between chemical synthesis routes in a
manner that preserves chemical relevance, TED is
weighed such that similar chemical reactions cost

less to rename. In practicality, this means lever-
aging the hierarchical nature of the reaction class
naming system described in section 2.1.4, such that
more similar chemical reactions cost less to re-
name. It is worth noting that the notion of chem-
ical similarity of synthesis routes only considers
reaction-level features of a synthesis tree, and not
the molecules partaking in said reactions. This is
because the primary focus is on the reactions in-
volved, rather than the participating molecules.

To further illustrate how TED calculates distance
on synthesis routes, an example is provided in Ex-
ample 2.

Example 2: Tree Edit Distance on synthesis
routes

Consider two synthetic routes, A and B, illustrated
in Figure 2.3 below. Note that the node features
have been excluded, as they are not taken into ac-
count when TED is calculated.

A

B

Figure 2.3: Two reaction trees A and B. Here, ■
represents a reaction, color denoting a difference in
reaction class, and ◦ represents a molecular node.

We can see that the operational cost of transform-
ing A to B corresponds to one insertion, and one
renaming operation.

2.2 REINVENT

This section describes the original implementation
of REINVENT. It cites closely the original work
[8], and aims to provide the reader with a brief but
complete understanding of how REINVENT works.
In this thesis, the focus lies on the introduction


2.2 REINVENT 5

of two scoring components into REINVENT: the
Reference scoring component, which attempts to
generate molecules that have an identical synthetic
route to a provided reference, and the Emergent
scoring component, which instead produces syn-
thetically consistent molecules emergently. For this
reason, it is important to have a good understand-
ing of how the scoring system works. We begin
by considering the architectural backbone of REIN-
VENT, and then proceed to explain the fine-tuning
mechanism.

2.2.1 Recurrent Neural Networks

A Recurrent Neural Network (RNN) is a type of
neural network architecture that is designed to cap-
ture dependencies between steps in sequential data
[16]. They are characterized by their feedback con-
nections, a connection between a hidden neuron
and itself, see Figure 2.4a. The feedback con-
nections allow the previous neuron state to influ-
ence the interpretation of the current state. RNNs
are trained with Backpropagation Through Time
(BTT), where the feedback is unrolled over sev-
eral time steps. This approach essentially allows
for gradient flow to be traced back through each
time step, enabling the network to learn the associ-
ation between a given state and its relative position
in a sequence [17]. See Figure 2.4b for a represen-
tation of the feedback unrolling.

ht

xt

Ot

whx

woh

whh

(a)

h1

x1

O1

whx

woh

h2

x2

O2

whx

woh

whh
. . . ht

xt

Ot

whx

woh

(b)

Figure 2.4: Schematics of a Recurrent Neural
Network (left) and it’s corresponding unrolled state
(right). Here, w relates the connection weights
between neurons, t being the time step and O, x and h
being the output, input and hidden states respectively.

The presence of feedback connections makes RNNs
particularly suitable for sequence prediction, and
by extension sequence generation, since context
about earlier time steps is propagated through
the network. However, the feedback mechanism
also makes learning long-term dependencies diffi-
cult due to the increased susceptibility to vanishing
or exploding gradients as the sequence length in-
creases [16, 17]. This is often remedied by the intro-
duction of gated neuron mechanisms, which are ca-
pable of regulating the flow of information through
the network by retaining or discarding information
dynamically, through the inclusion of learnable pa-
rameters that can modify the state of a neuron [18,
19]. This enables the network to learn what in-
formation is relevant to keep or discard over long
sequences. One such mechanism, the Gated Recur-
rent Unit (GRU), will be considered in the following
section.

2.2.2 Gated Recurrent Unit (GRU)

The GRU is a type of neuron/cell in an RNN de-
signed to mitigate the risk of vanishing and explod-
ing gradients for long sequences. This is achieved
by replacing the standard neuron with a unit that
has two gates, the reset gate and update gate.
These gates modulate the flow of information in
the feedback connections. They are very similar in
their definition, but their influence on the update
rule is distinct.

The reset gate determines how much of the previous
state should be forgotten. It is defined as

rt = σ(Wr · [ht−1, xt] + br), (1)

with rt being the reset gate output at time t, σ
denoting the sigmoid function, Wr being the learn-
able weight matrix for the reset gate, ht−1 being
the previous hidden state, xt being the input at
time t and br being the bias.

The update gate determines how much of the previ-
ous state should be retained. It is defined similarly
to eq. (1),

zt = σ(Wz · [ht−1, xt] + bz), (2)

with zt being the output of the update gate at time
t and the suffix z denoting a separate set of parame-
ters for the update function for the same quantities
as in eq. (1).


6 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

The effect of the gates is realized in two stages.
First, the effect of the reset gate is incorporated by
calculating the new memory content of the current
state:

h̃t = tanh(Wc · [rt ⊙ ht−1, xt] + b), (3)

where h̃t denotes the candidate hidden state, unaf-
fected by the effect of the update gate, and Wc de-
notes yet another weight matrix, exclusive to this
operation. Additionally, ⊙ refers to element-wise
multiplication.

All of this is then combined into the final hidden
state, which is calculated according to

ht = zt ⊙ ht−1 + (1 − zt) ⊙ h̃t. (4)

This ensures that, if the update gate zt is close
to unity, the previous hidden state ht−1 is largely
carried over to the final state ht. Conversely, if zt is
close to 0, the final hidden state is almost entirely
replaced by h̃t

The combined effect of the gating mechanism al-
lows the GRU to retain relevant information in
the network states over longer sequences, reducing
the degradation that happens as the sequences get
longer.

2.2.3 RNN training on Next-Token Prediction

The RNN is trained on the task of next-token pre-
diction by maximum likelihood estimation of the
next token in the target sequence, conditioned on
the previous tokens in the sequence [8]. Here,
a token is a one-hot encoded representation of a
SMILES character. The output of the network at
each step is the probability distribution over the set
of available tokens, and the goal is to maximize the
likelihood of the correct token xt. This is equiv-
alent to minimizing the negative log-likelihood of
the distributions for each step t, and the training
loss becomes

J(θ) = −
T∑

t=1
log P (xt|xt−1, ..., x1) (5)

with J(θ) being the loss function, calculated batch-
wise with respect to the network parameters θ.
The network is then trained with backpropagation

through time, minimizing the loss using Stochastic
Gradient Descent (SGD) with the update rule

θt+1 = θt − α∇J(θt), (6)

with α being the learning rate.

2.2.4 Sequence Generation with RNNs

Generation of sequences is performed in an auto-
regressive manner, wherein the next token in the
sequence is randomly sampled from the output dis-
tribution at each time step. Each generated token
is then iteratively fed back as input to the network
at the next time step. This relies on two tokens
that encode the start of a sequence, GO, and the
end of a sequence EOS. The generation ends once
the EOS token is drawn. Figure 2.5 provides an
overview of token generation for an RNN with one
cell.

h1

GO

x1

h2

x2

. . . ht

EOS

Figure 2.5: A schematic of an RNN performing
autoregressive generation of tokens. The output of a
time step t is fed as input to the next step, until the
termination token EOS has been generated.

2.2.5 Reinforcement Learning

Reinforcement Learning (RL) is a machine learn-
ing optimization technique where an agent learns
from the consequences of its actions a, and is re-
warded for desirable behaviors [20]. Unlike other
learning algorithms, RL does not rely on prede-
fined labels but instead uses a policy π which aims
to determine the appropriate action in response to
a specific state s of the environment. This policy
can be either deterministic, wherein one action is
determined for each state, or probabilistic. For a
probabilistic policy, the policy describes the prob-
ability distribution over actions given a particular


2.2 REINVENT 7

state. Additionally, a reward function R(s, a) as-
sesses the desirability of an action in a certain state.
The goal of the agent is then to learn the policy that
maximizes the expected reward E(Gt) where Gt is
the sum of discounted rewards at timestep t,

Gt =
∞∑

k=0

γkRt+k+1. (7)

Here, γ ∈ [0, 1] is the discount factor, through
which future rewards are devalued. This means
that a reward received at k time steps in the fu-
ture is worth γk−1 times as much as if it was given
immediately [20]. Note that there are two indices
that correspond to time, with t indicating the cur-
rent timestep at which the decision is made, and k
serving as the step offset from t, counting the num-
ber of timesteps into the future for each reward.

We can then form the policy-dependent action-
value function, which represents the expected re-
turn of taking action a in state s, and thereafter
following the policy π. The action-value function is
written as

qπ(s, a) = Eπ[Gt|s, a] =

[ ∞∑
k=0

γkRt+k+1|s, a

]
π

. (8)

The action-value function allows us to evaluate the
effectiveness of each action in a specific state under
the current policy, forming a relationship between
policy and expected reward. We will leverage this
in the following section.

2.2.6 Policy-Gradient Methods

Through parameterization of the policy, policy-
gradient methods aim to directly optimize the pol-
icy π(a|s, θ) through adjustments of the policy pa-
rameters θ [20]. Any parameterization is possible,
as long as the policy is differentiable with respect
to its parameters.

Presented in eq. (9) is the policy-gradient Theorem
for an episodic task with a probabilistic parame-
terized policy π(a|s, θ), with parameters θ. The
theorem states that the gradient of the loss with
respect to the policy parameters is proportional to
the expected return of operating under the policy
[20]. We can describe this more formally as follows.

Let µ(s) denote the on-policy distribution over the
visited states, qπ(a, s) be the expected return of fol-
lowing the policy at state s, and ∇π(a|s, θ) be the
gradient of the policy with respect to the parame-
ters θ. Then, the policy gradient theorem can be
written as

∇J(θ) ∝
∑

s

µ(s)
∑

a

qπ(a, s)∇π(a|s, θ), (9)

where ∇J(θ) is the gradient of the loss with regards
to the policy parameters θ [20].

One specific type of policy-gradient method is the
REINFORCE algorithm [20], which subjects an
agent to learning the optimal policy by explicitly
computing the gradients of the policy with respect
to its performance, evaluated episodically. In this
case, an episode is a task that has a clear beginning
and end and is recorded as a sequence of actions. In
the following section, the REINFORCE algorithm
will be covered briefly.

2.2.7 REINFORCE

The REINFORCE algorithm employs stochastic
gradient descent to update policy parameters. In
this section, we will explore the derivation of the
update rule that enables this process.

We begin by expressing eq. (9) in terms of the ex-
pected value over both actions and states under the
policy,

∇J(θ) ∝ Eπ

[
qπ(At, St)

∇π(At|St, θ)
π(At|St, θ)

]
, (10)

which can be expressed even more compactly by
noting that qπ(At, St) = Eπ[Gt|St, At], i.e. the
return at time t, and by utilizing the identity
∇log x = ∇x

x :

∇J(θ) ∝ Eπ[Gt∇log π(At|St, θ)]. (11)

Equation (11) reveals a quantity that can be sam-
pled at each time step, and which is proportional
to the gradient of the loss. This allows us to form
the REINFORCE update for the policy parameters,
denoted as a step in stochastic gradient descent:

θt+1 ≈ θt − αGt∇log π(At|St, θt), (12)

where α is the learning rate.


8 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

This is important, because it allows us to frame the
fine tuning process based entirely on the return of
each sequence. By letting the return of a sequence
denote its desirability, we can condition the genera-
tion to favor compounds which are highly desirable.
The process of formulating such a return is covered
in the following section.

2.2.8 Fine Tuning Molecular Generators

Recall that the output of the network at each time
step is the probability distribution over the possible
tokens given its current state. By letting an action
denote the addition of a token, and the state being
the current sequence of tokens, the network can be
viewed as a policy. This is the policy which we seek
to optimize.

Consider an RNN trained on the task of next-token
prediction of SMILES strings, henceforth referred
to as the prior network. Now consider a separate
network that is an identical copy of the prior net-
work, which we will refer to as the agent network.
Our goal is to have the agent learn to produce
molecules that score highly under a scoring func-
tion S ∈ [0, 1], that is, we want to increase the
likelihood of generating high-scoring sequences un-
der the agent policy. We begin by subjecting the
agent to the decision of what token to add given its
current state.

Let at denote the action of adding a token to a se-
quence being generated, and st be the state at time
t. The sequence of performed actions is denoted A.
Then,

P (A) =
T∏

t=1
π(at|st) (13)

represents the likelihood of the formed sequence A
under the policy π [8]. At this point, we want to up-
date the policy of the agent such that the expected
score of the generated sequences S(A) is increased,
whilst also anchoring the agent to the prior in such
a way that the agent policy doesn’t stray too far
away from the prior. This is because the prior re-
tains good knowledge of SMILES syntax. The an-
choring is achieved in two steps. First, we denote
an augmented likelihood,

logP (A)U = logP (A)prior + σS(A), (14)

which specifies the likelihood of the sequence under
the prior, modulated by the score of the sequence,
where σ is a scalar weight. Then, we denote the re-
turn of a sequence G(A) as the negative squared dif-
ference of the augmented and agent log-likelihoods:

G(A) = −[log(P (A)Agent − logP (A)U]2, (15)
which aims to capture the agreement between the
likelihoods for a given sequence. It is worth noting
that this formulation is an extension to the REIN-
FORCE algorithm made by the original authors of
REINVENT [8].

By substituting the long-term reward in eq. (12)
with G evaluated on an entire sequence, we are
able to frame the fine-tuning problem in terms of
a REINFORCE algorithm update. Thus, by sub-
jecting the agent to minimizing the loss function
J(θ) = −G, we gain two things. First, we gain
the update rule with which we are able to traverse
parameter space (eq. (12)), and secondly, we make
sure that the training process favors highly scoring
sequences whilst also minimizing the discrepancy
between the agent and the prior policy.

2.2.9 Scoring System

The scoring in REINVENT aims to determine the
value of each generated compound, and is vital for
guiding the agent in the desired direction. As this
score is what is used to determine the augmented
likelihood of a sequence (eq. (14)), and by extension
the return (eq. (15)) which we use to calculate our
parameter updates, it naturally becomes an impor-
tant part in guiding the fine-tuning.

A scoring function S(A) is defined to assess the de-
sirability of a generated sequence A. It returns a
numerical value in the range [0, 1], where a higher
score denotes greater desirability. To better explain
the process of scoring, an example of scoring molec-
ular sequences based on their molecular weight is
provided.

Example 3: Scoring based on molecular weight
threshold
Consider a generated molecule described by the se-
quence A with molecular weight WA, and a scoring
function

S(A) =

{
1 if Wmin ≤ WA ≤ Wmax

0 otherwise
(16)


2.3 AiZynthFinder 9

with lower and upper thresholds Wmin, Wmax re-
spectively. It is evident from the definition of S(A)
in Equation (16) that a molecule with molecular
weight outside the interval will get a score of 0.
Together with Equation (14), this results in no up-
dates to the agent policy in the direction of such a
molecule.

Example 3 provides a trivial case, but the intuition
holds true for more complex scoring functions as
well. For example, it is common to use a continuous
scoring function that has a positive slope in the
direction of the highest-scoring region. This will
ensure that as the output moves in the direction of
the desirable region, better and better scores are
returned, continuously rewarding the agent. Such
scoring components are covered in appendix A.

2.2.10 Multiple Parameter Optimization

It is possible to evaluate multiple scoring compo-
nents per generated sequence. The score of sepa-
rate components can then be aggregated into one
by arithmetic or geometric mean. As an example
of such an aggregate, consider two arbitrary but
distinct scoring functions. An agent constrained
by these functions would then be driven to gener-
ate sequences that score highly under both scoring
functions. For the rest of this thesis, the aggrega-
tion of multiple scoring functions will be referred
to as Multiple Parameter Optimization (MPO).

The relationship between scoring components dic-
tates the behavior of the RL. In a simple case, two
scoring components can be synergistic, where com-
pounds that score highly under one component also
score high under the other, narrowing the search.
It can also be destructive, wherein different com-
ponents ”pull” in different directions of chemical
space. As the number of components grow, the in-
teractions between components grow increasingly
difficult to analyze.

2.3 AiZynthFinder

Like section 2.2, this section aims to provide the
reader with a brief understanding of the retrosyn-
thetic framework AiZynthFinder. It is worth not-
ing that, for the purpose of this project, this system
is used as-is without any modifications. As such,

the explanation of the system will be more qualita-
tive in nature, not going into great detail in favor
of brevity.

2.3.1 Computer Aided Synthesis Planning

The goal of retrosynthetic analysis is to suggest
possible synthetic routes for a given chemical com-
pound [21]. In Computer Aided Synthesis Planning
(CASP), this is performed algorithmically in silico
[22, 23].

Since the set of possible routes for a given molecule
is large [24–26], a retrosynthetic algorithm will
search for possible solutions. This search is condi-
tioned on some external factors, such as availabil-
ity of reagents, chemical complexity, or cost [11,
27]. In AiZynthFinder, this search is a Monte Carlo
Tree Search (MCTS) guided by different neural net-
work policies [7, 11].

2.3.2 Monte Carlo Tree Search (MCTS)

MCTS is a policy-based search algorithm designed
to iteratively build a decision tree of optimal ac-
tions [28, 29]. In order to determine the optimal
actions, it estimates the value of different paths in
the tree through Monte-Carlo simulations of poten-
tial outcomes.

In its basic form, it consists of 4 steps,

1. Selection, which starts at the root node and
recursively applies a selection policy to descend
the tree.

2. Expansion, which adds children to a selected
node according to the available actions.

3. Simulation, wherein an outcome for the new
children is simulated based on a simulation pol-
icy. The reward is evaluated once a simulation
reaches a terminal state.

4. Backpropagation, where the simulation result
is propagated upwards from the traversal, up-
dating the statistics for each visited node.

An illustration of the steps in MCTS is shown in
Section 2.3.2 below.

The Selection and Expansion are governed by the
Selection Policy, a function that determines which


10 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

1. Selection 2. Expansion 3. Simulation 4. Backpropagation

Figure 2.6: The 4 phases of the Monte Carlo Tree
Search Algorithm.

nodes to grow based on some measure, often the ex-
pected reward [28, 29]. The Simulation is governed
by the Simulation Policy, which aims to estimate
a valuation for a non-terminal state. This is the
Monte-Carlo part of the algorithm, where an eval-
uation of the action is performed by stochastically
traversing the possible states starting from the cur-
rently selected state. This process continues until
a specified recursion depth or a terminal node is
reached [28, 29].

The basic MCTS algorithm is very flexible, and
what AiZynthFinder concerns is an adaptation of
the original MCTS algorithm, which will be cov-
ered in the following section.

2.3.3 Neural Network guided MCTS

AiZynthFinder builds upon the basic MCTS algo-
rithm by incorporating 3 pre-trained Neural Net-
work policies [11], referred to as 3N-MCTS. In
AiZynthFinder, an action represents a chemical re-
action, and a state represents a molecule. The 3
Neural Policies are as follows:

1. Expansion Policy Network, which guides the
tree search to only consider the most promising
actions.

2. In-Scope Filter Network, a binary classifier
trained to predict the chemical feasibility of
a suggested reaction.

3. Rollout Policy Network, which determines how
actions are sampled in the simulation step. Es-
sentially, the Monte-Carlo simulation is done
under the distribution of the Rollout Policy.

This formulation allows for considerable chemical

knowledge to be injected into the decision process,
as well as optimizations with regards to cost or
availability of precursor, wherein multiple networks
can be trained at specific tasks that narrow the
combinatorial complexity of the search while simul-
taneously increasing the search quality.


Methods 11

3 Methods

The following section describes the implementation
of the two scoring components, which is the main
contribution of this thesis. Additionally, it presents
the experimental setup through which the perfor-
mance of the scoring components was evaluated. It
builds upon the knowledge presented in section 2,
and serves as a basis for the results presented in
section 4.

3.1 Molecular Generation conditioned on Syn-
thetic Similarity

In order to condition a REINVENT agent to out-
put molecules that have a similar synthetic route,
molecules that share synthetic routes need to be
scored highly under the agent. This requires two
things - an online retrieval of synthesis routes for
each generated molecule, and a robust notion of
route similarity. In Figure 3.1 on the next page, a
high-level overview of the system can be seen.

The implementation of the separate parts will be
covered in the following sections.

3.1.1 Connecting AiZynthFinder and REINVENT

Both AiZynthFinder and REINVENT are imple-
mented in Python, exist as CLI tools, and can
interact with their environment through different
commands. An AiZynthFinder run starts by be-
ing supplied with a set of SMILES, on which it
performs 3N-MCTS and returns a set of suggested
synthesis routes. A REINVENT run, on the other
hand, generates a batch of SMILES according to
the agent policy. Thus, in order to achieve online
retrieval of synthetic routes for each molecule, we
simply start a run of AiZynthFinder that performs
retrosynthesis on each generated molecule. This is
illustrated further in the following example.

Example 4: An example scoring run.

Consider a run of REINVENT generating a batch
of SMILES. As the batch is completed, the agent
seeks to determine the score for each compound
and update its policy accordingly. At scoring time,
REINVENT starts AiZynthFinder as a subprocess,
feeding to it each generated SMILES, and getting
in return a set of possible synthesis routes for each

SMILES. At this point, each route for each gener-
ated molecule is scored according to its similarity
to a provided reference route. The reference route
is fixed for the entire run. The best scoring tree
(being the one with the lowest distance to refer-
ence) for each molecule is saved, and the score for
said tree is reported as the score of the molecule.
The augmented and the agent likelihoods are cal-
culated, formulating the batch-wise loss, and the
parameters of the agent network are updated ac-
cording to eq. (12).

This procedure iterates until it is terminated.

3.1.2 Route Similarity in REINVENT

Measuring similarity between pairs of synthesis
trees is done using a Python implementation of
TED [15, 30] with a custom configuration for the
operational costs. The operational costs are depen-
dent upon the NameRXN classification system as
described in section 2.1.5, and are presented in Ta-
ble 1.

Table 1: Table of Operational Costs for the TED
operations. Super class references the first position in
the NameRXN classification system, i.e., 2.x.x, class
the second position, and subclass the final position.

Operation Cost

Insertion 1
Deletion 1
Rename superclass 0.8
Rename class 0.5
Rename subclass 0

We see from table 1 that there is a higher cost as-
sociated with larger hierarchical differences in reac-
tion space. Since these differences are more chem-
ically dissimilar, the associated cost aims to drive
the generated molecules to a more constrained area
in chemical space. Additionally, topological differ-
ences between synthesis trees, i.e. the addition or
removal of reactions, are penalized the most.

3.1.3 Scoring based on Reference Route

By providing REINVENT with a reference route
and calculating the TED between it and the route
for each generated compound, the similarity for


12 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

AiZynthFinderSCORING

SMILES

Synthesis RoutesScores

Query Distance to Reference

SMILES

REINVENT

Figure 3.1: A High-Level Overview of the connection between REINVENT and AiZynthFinder through a
scoring component. Here, the scoring component passes the SMILES to AiZynthFinder which returns
retrosynthetic suggestions for each molecule. The Scoring component then calculates the distance to reference
via TED, and returns the corresponding score to REINVENT. In this view, the scoring component is modeled
as a separate entity, even though it is technically a part of REINVENT.

each compound can be quantified. In this case, the
best scoring route for each molecule is the route
that has the lowest TED to a provided reference
route. Thus, the provided reference route serves as
a template for which reactions are considered good,
guiding the generation of molecules in the direction
of the reference. This allows for the exploration of
compounds when a reaction route is known.

In this thesis, two references were considered, of
which an abstraction is shown in fig. 3.2. The
two routes aimed to occupy ”complex” and ”sim-
ple” chemistry, exploring the efficacy of the method
when subjected to two different levels of synthetic
complexity. They were selected manually, and cat-
egorized as complex or simple based on the num-
ber of reactions and prevalence of said reactions.
In essence, the complex route had more synthesis
steps as well as more uncommon reactions than the
simple route.

3.1.4 Emergent Route Distance Score

In the emergent scoring case, the focus shifts from
minimizing the TED to a reference to instead max-
imizing the size of a singular group of reaction
routes. Groups are based on the NameRXN clas-
sification system and are created for each batch of
molecules. The process is outlined as follows.

For each generated compound, REINVENT re-
trieves a corresponding set of reaction routes. Each
route is then assigned a shorthand identifier de-
noted as the chain of reactions represented by the
NameRXN system. The frequency of each route in

A

B

Figure 3.2: Simple synthetic route (A) and complex
synthetic route (B). The simple route represents a 2
step synthesis with common reactions, and the
complex route is a 4 step synthesis with different and
more uncommon reactions.

the batch is then recorded in a dictionary, forming
groups where the size of each group indicates its
prevalence. Then, for each compound, the route
that falls into the largest possible group is selected.
The scoring component also implements a filter
that cumulatively sums the size of each group for
each epoch. If the number of molecules exceed the
filter threshold, the score is set to 0. The score for
each compound is then calculated as

S(A) =


(

G(A)−min(G)
max(G)

)2
if G(A) < F

0 otherwise
, (17)

where G is the dictionary of groups, A is a SMILES
sequence, F is the filter, and G(A) denotes the size
of the largest group that is occupied by a route that
can synthesize A.

Thus, generated molecules that can be synthesized


3.2 Parameter Evaluation 13

by the largest group is favoured. Conversely, small
groups will be scored low. Note that the score is
evaluated batch-wise, which means that the group
sizes, and thus scores, are dynamic. This further
encourages the collapse into one large group, as the
average score per batch will increase when the num-
ber of groups is low.

3.2 Parameter Evaluation

REINVENT has many different hyperparameters
that can change the outcome of the learning. Thus,
in order to properly evaluate the effect of the in-
troduced scoring component, the effect of different
configurations needs to be explored. Additionally,
this provides insight into how an optimal configu-
ration might look.

Due to computational constraints, the influence of
parameters was evaluated using a one-parameter-
at-a-time approach, wherein parameters were var-
ied from a baseline configuration. In order to gain
confidence of a noted influence, three replicates
were made. The base configuration consisted of
parameters at levels that had shown success when
applied to adjacent tasks. In table 2 below, the
base configuration can be seen.

Table 2: Table showing the configuration for the
baseline route distance scoring run. Here, σ refers to
the scaling parameter in eq. (14).

Parameter Value
Batch Size 64
Learning Rate 0.001
σ 128
Randomize Smiles* True
Diversity Filter* Identical Murcko Scaffold
Scoring Transform* Reverse Sigmoid

*Covered in Appendix

Some parameters, such as batch size and learning
rate, are self-explanatory and will not be explained
further. The other parameters, however, will be ex-
plained in appendix A. The parameters were var-
ied based on this configuration and run for three
replicates on the simple reference route. The best-
performing configuration is then transplanted to an
MPO run for both the simple and complex route,
as described below.

3.3 Multiple Parameter Optimization - Realistic
Application

By subjecting REINVENT to multiple scoring com-
ponents simultaneously, REINVENT will attempt
to find the subspace of the output that contains
high-scoring molecules for all components. This
allows for the exploration of the impact that the
scoring component has in a realistic application,
wherein a user aims to satisfy multiple properties
at the same time. A generated molecule is consid-
ered a ”hit” if it scores above the supplied threshold
values for all of the component scores.

In this thesis, MPO is used for the purpose of ex-
ploring the performance of the route distance com-
ponent when it is joined by multiple scoring com-
ponents, aimed at emulating a supposed realistic
application of REINVENT. It does this by attempt-
ing to replicate a previous study [10], with the
added constraint of route similiarity. The scoring
components and their settings are described in ap-
pendix B, alongside a table showcasing the thresh-
old values.


14 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

4 Results

In this section, we present the results for the two
scoring components. The results for the reference
scoring component are shown as a parameter evalu-
ation and an MPO run comparing the performance
to a baseline. For the emergent scoring component,
only the performance for one MPO run is shown.

Here, performance is measured as the cumulative
number of hits for an MPO run, where hits refer
to generated molecules that score sufficiently high
under all active components. The components for
this MPO experiment were chosen to replicate a
previous study [10], with the added constraint of
route similarity.

4.1 Reference Route Scoring Component

The results for the reference route scoring compo-
nent is presented in two parts. First, as an assese-
ment of the parameter evaluation aimed at discov-
ering a good configuration of the component, and
then as the performance of an MPO run compared
to a baseline.

4.1.1 Parameter Search

The experiment was set up according to the de-
scription in section 3.2, where each parameter was
varied from a base configuration and ran for three
replicates. The average performance of each evalu-
ation can be seen in Figure 4.1 on the next page.

Looking at Figure 4.1, it is clear that the expo-
nential scoring transform vastly outperforms the
other configurations, showing that the majority of
the generated compounds have 0 distance to the
provided reference at the end of the run. On the
contrary, the inverse and lenient scoring transforms
produce the lowest number of compounds with 0
distance to the reference for the investigated time-
frame. We can also see that increasing the learning
rate two-fold resulted in instability, as noted by the
decrease in valid routes after around 200 epochs,
whereas the halved learning rate slowly increased
the mid-low distance compounds but not converg-
ing.

4.1.2 MPO

The multiple parameter optimization experiment
was carried out for the simple and complex route,
and compared to a baseline MPO without route
distance scoring. As described in section 3.3, a
molecule is considered a hit if it exceeds the given
score threshold for each component. For the base-
line, it means that a hit share route with the refer-
ence by pure chance. In Figure 4.2, the number of
cumulative hits between a baseline and a run scored
with the route distance component is compared.

0 200 400 600 800 1000
Epoch

0

10

20

30

40

50

60

H
its

Cumulative Sum over Number of Hits - Average of 5

Route Distance
Baseline

Figure 4.2: Multi Parameter Optimization for the
simple route, compared between a baseline and a route
distance scoring run. The standard deviation is shown
as the translucent spread, calculated over 5 replicates.

From Figure 4.2 we see that, while both methods
are capable of generating hits with a consistent syn-
thetic strategy, the route distance scoring compo-
nent outperforms the baseline by having a more
productive trajectory, generating more hits per run
on average.

Using the same experimental setup for the com-
plex route, we found that no hits were produced
for either the baseline or the route distance scor-
ing component during any of the three attempted
replicates. This indicates a failure to find a chemi-
cal subspace that satisfies all the scores.

4.2 Emergent Route Scoring

Emergent scoring is able to discover groups with
molecules whose chemical synthesis routes are iden-
tical, without requiring an explicit template. In
Figure 4.3, an MPO scoring run with the emergent
component is shown.


4.2 Emergent Route Scoring 15

TED = 0
0.5 <= TED < 0.8
0.8 <= TED < 1.3
1.3 <= TED

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Baseline

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Penalize Same Smiles

0 200 400 600 800 1000
Epoch

0

20

40

R
ou

te
s

Learning Rate 0.0005

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Learning Rate 0.002

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Learning Sigma 256

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Learning Sigma 512

0 200 400 600 800 1000
Epoch

0

25

50

75

100

R
ou

te
s

Batch Size 128

0 200 400 600 800 1000
Epoch

0

20

40

60
R
ou

te
s

Randomize SMILES Off

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Scoring transform (Exp)

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Scoring transform (Inv)

0 200 400 600 800 1000
Epoch

0

20

40

60

R
ou

te
s

Scoring transform (Lenient)

Parameter Evaluation of Reference Route Distance Component, Average of 3 Replicates

Figure 4.1: Performance evaluation shown as the distance to reference per route, plotted per epoch. Each
subplot is titled by the parameter being evaluated. A high fraction of low-distance compounds means better
performance.

The upper figure shows the cumulative sum over
output molecules grouped by reaction route for
each epoch, showing how REINVENT is capable
of finding a lucrative route and exploiting it until
the filter mechanism is activated and the score is
reduced to 0. We note that the cumulative sum
keeps increasing for some routes even after the fil-

ter is activated, a result of the filter mechanism act-
ing on the returned scores rather than blocking the
generation entirely. From the point of filter activa-
tion and onward, there is no reward for generating
molecules that belong to that group, however, it is
still possible to do so. We see that this behaviour is
present throughout the run, where the agent enters


16 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

0 200 400 600 800 1000
Epoch

0

1000

2000

3000

C
um

ul
at
iv
e
Su

m

Cumulative Sum of Routes over Epoch
Routes

2.1, 2.1
1.6, 2.1
1.2
2.1, 1.3
2.1, 9.3
1.3
2.1
1.6
2.1, 1.6
2.1, 1.2

0 200 400 600 800 1000
Epoch

0

50

100

150

200

C
um

ul
at
iv
e
N
um

be
r
of

H
its

Cumulative Sum of Hits by Route over Epoch
Routes

2.1, 1.2
2.1, 1.6
2.1, 1.3

Figure 4.3: Upper: Cumulative sum over routes for each epoch. Lower: The cumulative hits produced
during the MPO run categorized by route. In the upper plot, the red line corresponds to the filter activation
threshold. Additionally, routes with a cumulative sum below 1500 and 25 have been omitted for the upper and
lower plots respectively.

a productive phase for one particular route until
the filter is activated, rapidly proceeding to exploit
another route.

Looking at the lower part of Figure 4.3, we see the
cumulative number of hits. Here, a hit is based
on the same criterion as in section 4.1.2, i.e. a
molecules has to score sufficiently high score for
all present components. We can note that there is
a clear overlap between the exploited route in the
upper figure and the productive route in the lower
figure. However, we also note that the productive
phases doesn’t begin until around epoch 700. Up
until that point, the agent has not yet learned to
reliably produce molecules that perform well under
the other scoring components in the MPO. Once

the agent has learned sufficiently well, we see that
the production of hits coincide with the exploita-
tion of a certain route, with short lag times between
filter activation and exploitation.


Discussion 17

5 Discussion

This thesis covered the implementation of two dif-
ferent scoring components for generating molecules
conditioned on synthetic similarity. In this section,
we discuss the results presented in section 4.

5.1 Reference Route Scoring

The reference route scoring component is, through
the explicit measure of the TED between all gen-
erated compounds and the provided reference, able
to guide the generation towards a region in chemi-
cal space that contains molecules with a consistent
synthetic strategy. However, this relies upon the
user providing a feasible reference route. Thus, the
use of this scoring component is aimed at the cases
of drug discovery where a synthetic route is already
available.

Additionally, the reference routes used in this the-
sis were curated manually, with the reasoning pre-
sented in section 3.1.3. By performing manual se-
lection of routes based on a manufactured notion
of complexity considerable bias might have been
introduced to the system. It is possible that the
selected complex route isn’t synthetically complex,
but just synthetically improbable. Here, the role
of AiZynthFinder can not be understated, as it
aims to minimize the cost of the routes it suggests.
This means that shorter routes are inherently fa-
vored, leading to a smaller chance of finding iden-
tical routes when the reference is long - a valuable
insight for the further development of the compo-
nent, as it suggests a limitation of the combined
system. Overall, a better survey on reaction chem-
istry could have allowed for more robust assump-
tions, leading to a better evaluation of the compo-
nents performance.

Furthermore, the parameter evaluation in sec-
tion 4.1.1 revealed that the scoring component was
very sensitive to which scoring transform was used.
This is particularly evident when looking at the
exceptional performance of the exponential scor-
ing transform. It is interesting to note that the
exponential scoring function was something that
had not been previously implemented in REIN-
VENT. Conversely, the inverse and the lenient
scoring transforms performed surprisingly poorly.
It is thought that this is due to these scoring func-

tions exhibiting a worse payoff - the difference in
return for a sequence that is marginally closer to
the reference is not high enough to guide the net-
work. The higher the payoff, the stronger the pull.

5.1.1 MPO - Real Application

The effect of aggregating multiple scoring compo-
nents is illustrated in the MPO runs, where multi-
ple scoring components interact to make the search
space difficult to traverse. However, both the base-
line and the scoring component for the simple route
manage to generate components that have zero dis-
tance to the simple route. The most interesting
point of discussion here is the fact that the baseline
generated a surprising amount of compounds with
zero distance to reference, even though it was never
subjected to information about the synthesis routes
of the molecules it generated. This could possibly
be attributed to the flexible chemistry of the simple
reference route, which consists of common reactions
that potentially cover the synthesis of many differ-
ent compounds, emerging spontaneously as a good
candidate route from AiZynthFinder. This is fur-
ther strengthened by the high spread seen in the
baseline - some runs will, by chance, end up pro-
ducing molecules that can be synthesized with the
supplied reference route. However, since there is no
constraint, other runs will produce molecules that
is synthesized by different synthetic routes.

5.2 Emergent Route Scoring

The emergent route scoring allows for the discov-
ery of synthetically parallelizable compounds with-
out the need for any prior knowledge about a com-
pounds synthesis. By foregoing the need to evalu-
ate similarity between routes, instead taking only
equality into account, the method becomes faster
and sharper. In the upper part of Figure 4.3, we
see how the number of molecules evolve with epoch,
showcasing the flexibility and speed of the method
- as each filter is applied, the agent rapidly finds a
new route to exploit.

The emergent scoring, as it has been implemented
for this thesis, also removes the need to use a scor-
ing transform. This reduces the complexity of the
parameter search and removes one moving part
from this otherwise very complex system.


18 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

5.2.1 MPO for the emergent scoring

Naturally, the emergent scoring manages to gener-
ate more hits than the reference scoring component
- removing the restriction of a reference route allows
the agent to explore more freely. Now, many solu-
tions exist, whereas previously only one would be
accepted. What is interesting is that the emergent
scoring component seems to retain the ability to
generate high-scoring compounds under the rest of
the components, even when the filter activates and
drops a route score to 0, which can be seen in the
later stages of the lower part of Figure 4.3. As the
filter is activated, a new productive route is found
almost immediately. This is likely a consequence of
evaluating the scoring component dynamically and
batch-wise, where the formation of a large majority
group is strongly encouraged - resulting in a rapid
collapse into one productive group.

5.3 Further research

A topic for further research would be to make a
qualitative analysis of how reference route com-
plexity relates to performance, where the notion
of complexity is robustly determined. As has been
shown in this work, the chosen complex route did
not yield any hits for the MPO runs, which could
be due to either a destructive relationship between
the other components and the complex reference,
or due to the fact that AiZynthFinder guides its
search towards shorter routes. The correlation be-
tween reference route complexity and generative
performance should thus be investigated further,
in order to learn more about the limits of this com-
ponent.

Another topic is optimizing the operational weights
for TED calculations. The weights used to deter-
mine the operational costs for TED, as presented in
Table 1, is an area that was not explored in great
detail during this thesis, even though it undoubt-
edly has a large impact on how the scores are calcu-
lated. For example, the lowest possible distance to
a compound is 0.5, corresponding to a difference in
the ”class” level of a named reaction. This will then
be transformed to different values depending on the
scoring transform. Thus, since the scoring trans-
forms showed great impact on the performance of
the component, it is safe to assume that the oper-
ational weights of TED would have at least some

impact as well. It is thus of great interest to know if
additional performance can be gained by adjusting
these weights.

Finally, while the emergent route distance compo-
nent shows great initial promise, it has not been
assessed robustly. Minor alterations to configura-
tions have been attempted based on intuition, as
opposed to the systematic evaluation seen in Fig-
ure 4.1 for the reference route component. Perhaps
there are further performance gains to be discov-
ered.


Conclusion 19

6 Conclusion

In this thesis we have demonstrated a novel method
for generating molecules with a consistent synthetic
route. By connecting REINVENT with AiZyn-
thFinder, two distinct scoring components were
implemented - one which is capable of generating
molecules whose synthetic route adhere to an ex-
ternal reference, and one that emergently discovers
molecules with identical synthetic pathway. This
provides the answers to the questions that this the-
sis sought to adress. Furthermore, we are able to
show that the scoring components can be used in
junction with other components, allowing for the
exploration of molecules that fit a multitude of de-
sign criteria, whilst now also being synthesizable in
parallel. It is our hope that this will facilitate the
development of new pharmaceutical treatments by
enhancing the process of drug discovery.


20 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

References

[1] J. Hughes, S. Rees, S. Kalindjian, and K.
Philpott, “Principles of early drug discovery,”
British Journal of Pharmacology, vol. 162, no. 6,
pp. 1239–1249, 2011, issn: 1476-5381. doi: 10.
1111 / j . 1476 - 5381 . 2010 . 01127 . x. [Online].
Available: https://onlinelibrary.wiley.com/
doi/abs/10.1111/j.1476-5381.2010.01127.x
(visited on 04/23/2024).

[2] M. K. Mahapatra and M. Karuppasamy, “Funda-
mental considerations in drug design,” Computer
Aided Drug Design (CADD): From Ligand-Based
Methods to Structure-Based Approaches, pp. 17–
55, 2022. doi: 10.1016/B978- 0- 323- 90608-
1.00005- 8. [Online]. Available: https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC9212230/
(visited on 05/09/2024).

[3] S. Mandal, M. Moudgil, and S. K. Mandal, “Ra-
tional drug design,” eng, European Journal of
Pharmacology, vol. 625, no. 1-3, pp. 90–100, Dec.
2009, issn: 1879-0712. doi: 10.1016/j.ejphar.
2009.06.065.

[4] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona,
and T. Blaschke, “The rise of deep learning in
drug discovery,” Drug Discovery Today, vol. 23,
no. 6, pp. 1241–1250, Jun. 2018, issn: 1359-6446.
doi: 10.1016/j.drudis.2018.01.039. [Online].
Available: https://www.sciencedirect.com/
science/article/pii/S1359644617303598 (vis-
ited on 04/15/2024).

[5] A. W. Dombrowski, A. L. Aguirre, A. Shrestha,
K. A. Sarris, and Y. Wang, “The Chosen
Few: Parallel Library Reaction Methodologies for
Drug Discovery,” eng, The Journal of Organic
Chemistry, vol. 87, no. 4, pp. 1880–1897, Feb.
2022, issn: 1520-6904. doi: 10.1021/acs.joc.
1c01427.

[6] H. H. Loeffler et al., “Reinvent 4: Modern AI–
driven generative molecule design,” Journal of
Cheminformatics, vol. 16, no. 1, p. 20, Feb. 2024,
issn: 1758-2946. doi: 10 . 1186 / s13321 - 024 -
00812 - 5. [Online]. Available: https : / / doi .
org / 10 . 1186 / s13321 - 024 - 00812 - 5 (visited
on 03/18/2024).

[7] S. Genheden, A. Thakkar, V. Chadimová, J.-L.
Reymond, O. Engkvist, and E. Bjerrum, “AiZyn-
thFinder: A fast, robust and flexible open-source
software for retrosynthetic planning,” Journal of
Cheminformatics, vol. 12, no. 1, p. 70, Nov. 2020,
issn: 1758-2946. doi: 10 . 1186 / s13321 - 020 -
00472 - 1. [Online]. Available: https : / / doi .
org / 10 . 1186 / s13321 - 020 - 00472 - 1 (visited
on 03/18/2024).

[8] M. Olivecrona, T. Blaschke, O. Engkvist, and H.
Chen, “Molecular de-novo design through deep
reinforcement learning,” Journal of Cheminfor-
matics, vol. 9, no. 1, p. 48, Sep. 2017, issn: 1758-
2946. doi: 10.1186/s13321-017-0235-x. [On-
line]. Available: https : / / doi . org / 10 . 1186 /
s13321-017-0235-x (visited on 04/18/2024).

[9] D. Weininger, “SMILES, a chemical language
and information system. 1. Introduction to
methodology and encoding rules,” Journal of
Chemical Information and Computer Sciences,
vol. 28, no. 1, pp. 31–36, Feb. 1988, Publisher:
American Chemical Society, issn: 0095-2338.
doi: 10.1021/ci00057a005. [Online]. Available:
https://doi.org/10.1021/ci00057a005 (vis-
ited on 04/15/2024).

[10] M. Dodds, J. Guo, T. Löhr, A. Tibo, O. Engkvist,
and J. Paul Janet, “Sample efficient reinforce-
ment learning with active learning for molecular
design,” en, Chemical Science, vol. 15, no. 11,
pp. 4146–4160, 2024. doi: 10.1039/D3SC04653B.
[Online]. Available: https://pubs.rsc.org/en/
content/articlelanding/2024/sc/d3sc04653b
(visited on 05/09/2024).

[11] M. H. S. Segler, M. Preuss, and M. P. Waller,
“Planning chemical syntheses with deep neural
networks and symbolic AI,” en, Nature, vol. 555,
no. 7698, pp. 604–610, Mar. 2018, Publisher: Na-
ture Publishing Group, issn: 1476-4687. doi: 10.
1038/nature25978. [Online]. Available: https:
//www.nature.com/articles/nature25978 (vis-
ited on 05/07/2024).

[12] D. Weininger, A. Weininger, and J. L. Weininger,
“SMILES. 2. Algorithm for generation of unique
SMILES notation,” Journal of Chemical Infor-
mation and Computer Sciences, vol. 29, no. 2,
pp. 97–101, May 1989, Publisher: American
Chemical Society, issn: 0095-2338. doi: 10 .
1021/ci00062a008. [Online]. Available: https:
//doi.org/10.1021/ci00062a008 (visited on
04/18/2024).

[13] RDKit. [Online]. Available: https://www.rdkit.
org/ (visited on 06/10/2024).

[14] I. Lagersted, J. Mayfiel, and R. Sayl, NameRx
More than just a Reaction Classifier, en, 2021.

[15] M. Pawlik and N. Augsten, “Tree edit distance:
Robust and memory-efficient,” Information Sys-
tems, vol. 56, pp. 157–173, Mar. 2016, issn: 0306-
4379. doi: 10.1016/j.is.2015.08.004. [Online].
Available: https://www.sciencedirect.com/
science/article/pii/S0306437915001611 (vis-
ited on 03/18/2024).

[16] B. Mehlig, Machine Learning with Neural Net-
works: An Introduction for Scientists and Engi-

https://doi.org/10.1111/j.1476-5381.2010.01127.x
https://doi.org/10.1111/j.1476-5381.2010.01127.x
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1476-5381.2010.01127.x
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1476-5381.2010.01127.x
https://doi.org/10.1016/B978-0-323-90608-1.00005-8
https://doi.org/10.1016/B978-0-323-90608-1.00005-8
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9212230/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9212230/
https://doi.org/10.1016/j.ejphar.2009.06.065
https://doi.org/10.1016/j.ejphar.2009.06.065
https://doi.org/10.1016/j.drudis.2018.01.039
https://www.sciencedirect.com/science/article/pii/S1359644617303598
https://www.sciencedirect.com/science/article/pii/S1359644617303598
https://doi.org/10.1021/acs.joc.1c01427
https://doi.org/10.1021/acs.joc.1c01427
https://doi.org/10.1186/s13321-024-00812-5
https://doi.org/10.1186/s13321-024-00812-5
https://doi.org/10.1186/s13321-024-00812-5
https://doi.org/10.1186/s13321-024-00812-5
https://doi.org/10.1186/s13321-020-00472-1
https://doi.org/10.1186/s13321-020-00472-1
https://doi.org/10.1186/s13321-020-00472-1
https://doi.org/10.1186/s13321-020-00472-1
https://doi.org/10.1186/s13321-017-0235-x
https://doi.org/10.1186/s13321-017-0235-x
https://doi.org/10.1186/s13321-017-0235-x
https://doi.org/10.1021/ci00057a005
https://doi.org/10.1021/ci00057a005
https://doi.org/10.1039/D3SC04653B
https://pubs.rsc.org/en/content/articlelanding/2024/sc/d3sc04653b
https://pubs.rsc.org/en/content/articlelanding/2024/sc/d3sc04653b
https://doi.org/10.1038/nature25978
https://doi.org/10.1038/nature25978
https://www.nature.com/articles/nature25978
https://www.nature.com/articles/nature25978
https://doi.org/10.1021/ci00062a008
https://doi.org/10.1021/ci00062a008
https://doi.org/10.1021/ci00062a008
https://doi.org/10.1021/ci00062a008
https://www.rdkit.org/
https://www.rdkit.org/
https://doi.org/10.1016/j.is.2015.08.004
https://www.sciencedirect.com/science/article/pii/S0306437915001611
https://www.sciencedirect.com/science/article/pii/S0306437915001611


REFERENCES 21

neers. Cambridge: Cambridge University Press,
2021, isbn: 978-1-108-49493-9. doi: 10 . 1017 /
9781108860604. [Online]. Available: https :
/ / www . cambridge . org / core / books /
machine - learning - with - neural - networks /
0028D1883AF842C81340508119AB6491 (visited on
05/07/2024).

[17] A. Sherstinsky, “Fundamentals of Recurrent
Neural Network (RNN) and Long Short-Term
Memory (LSTM) Network,” Physica D: Nonlin-
ear Phenomena, vol. 404, p. 132 306, Mar. 2020,
arXiv:1808.03314 [cs, stat], issn: 01672789. doi:
10.1016/j.physd.2019.132306. [Online]. Avail-
able: http://arxiv.org/abs/1808.03314 (vis-
ited on 04/23/2024).

[18] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio,
Empirical Evaluation of Gated Recurrent Neural
Networks on Sequence Modeling, arXiv:1412.3555
[cs], Dec. 2014. [Online]. Available: http : / /
arxiv . org / abs / 1412 . 3555 (visited on
04/19/2024).

[19] S. Hochreiter and J. Schmidhuber, “Long Short-
term Memory,” Neural computation, vol. 9,
pp. 1735–80, Dec. 1997. doi: 10 . 1162 / neco .
1997.9.8.1735.

[20] R. Sutton and A. Barto, “Reinforcement Learn-
ing: An Introduction,” IEEE Transactions on
Neural Networks, vol. 9, no. 5, pp. 1054–1054,
Sep. 1998, Conference Name: IEEE Transactions
on Neural Networks, issn: 1941-0093. doi: 10.
1109 / TNN . 1998 . 712192. [Online]. Available:
https : / / ieeexplore . ieee . org / document /
712192 (visited on 04/18/2024).

[21] Z. Wang, W. Zhao, G. Hao, and B. Song, “Map-
ping the resources and approaches facilitating
computer-aided synthesis planning,” en, Organic
Chemistry Frontiers, vol. 8, no. 4, pp. 812–824,
Feb. 2021, Publisher: The Royal Society of Chem-
istry, issn: 2052-4129. doi: 10.1039/D0QO00946F.
[Online]. Available: https://pubs.rsc.org/en/
content/articlelanding/2021/qo/d0qo00946f
(visited on 05/07/2024).

[22] E. J. Corey and W. T. Wipke, “Computer-
Assisted Design of Complex Organic Syntheses,”
Science, vol. 166, no. 3902, pp. 178–192, Oct.
1969, Publisher: American Association for the
Advancement of Science. doi: 10.1126/science.
166 . 3902 . 178. [Online]. Available: https : / /
www.science.org/doi/10.1126/science.166.
3902.178 (visited on 05/07/2024).

[23] W.-D. Ihlenfeldt and J. Gasteiger, “Computer-
Assisted Planning of Organic Syntheses: The
Second Generation of Programs,” en, Ange-
wandte Chemie International Edition in English,

vol. 34, no. 23-24, pp. 2613–2633, 1996, issn:
1521-3773. doi: 10.1002/anie.199526131. [On-
line]. Available: https://onlinelibrary.wiley.
com/doi/abs/10.1002/anie.199526131 (visited
on 05/07/2024).

[24] S. Hong, H. H. Zhuo, K. Jin, G. Shao, and Z.
Zhou, “Retrosynthetic planning with experience-
guided Monte Carlo tree search,” en, Communi-
cations Chemistry, vol. 6, no. 1, pp. 1–14, Jun.
2023, Publisher: Nature Publishing Group, issn:
2399-3669. doi: 10.1038/s42004- 023- 00911-
8. [Online]. Available: https : / / www . nature .
com / articles / s42004 - 023 - 00911 - 8 (visited
on 05/07/2024).

[25] C. W. Coley, L. Rogers, W. H. Green, and
K. F. Jensen, “Computer-Assisted Retrosynthe-
sis Based on Molecular Similarity,” ACS Cen-
tral Science, vol. 3, no. 12, pp. 1237–1245, Dec.
2017, Publisher: American Chemical Society,
issn: 2374-7943. doi: 10 . 1021 / acscentsci .
7b00355. [Online]. Available: https : / / doi .
org/10.1021/acscentsci.7b00355 (visited on
05/07/2024).

[26] H. Dai, C. Li, C. W. Coley, B. Dai, and L.
Song, Retrosynthesis Prediction with Conditional
Graph Logic Network, arXiv:2001.01408 [cs, stat],
Jan. 2020. doi: 10.48550/arXiv.2001.01408.
[Online]. Available: http : / / arxiv . org / abs /
2001.01408 (visited on 05/07/2024).

[27] P. Ertl and A. Schuffenhauer, “Estimation of syn-
thetic accessibility score of drug-like molecules
based on molecular complexity and fragment
contributions,” Journal of Cheminformatics,
vol. 1, no. 1, p. 8, Jun. 2009, issn: 1758-2946.
doi: 10.1186/1758-2946-1-8. [Online]. Avail-
able: https://doi.org/10.1186/1758-2946-1-
8 (visited on 05/07/2024).

[28] C. Browne et al., “A Survey of Monte Carlo Tree
Search Methods,” IEEE Transactions on Compu-
tational Intelligence and AI in Games, vol. 4:1,
pp. 1–43, Mar. 2012. doi: 10 . 1109 / TCIAIG .
2012.2186810.

[29] M. Świechowski, K. Godlewski, B. Sawicki, and
J. Mańdziuk, “Monte Carlo Tree Search: A re-
view of recent modifications and applications,”
en, Artificial Intelligence Review, vol. 56, no. 3,
pp. 2497–2562, Mar. 2023, issn: 1573-7462. doi:
10.1007/s10462-022-10228-y. [Online]. Avail-
able: https://doi.org/10.1007/s10462-022-
10228-y (visited on 05/07/2024).

[30] M. Pawlik and N. Augsten, “Efficient Computa-
tion of the Tree Edit Distance,” ACM Transac-
tions on Database Systems, vol. 40, no. 1, 3:1–
3:40, Mar. 2015, issn: 0362-5915. doi: 10.1145/

https://doi.org/10.1017/9781108860604
https://doi.org/10.1017/9781108860604
https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491
https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491
https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491
https://www.cambridge.org/core/books/machine-learning-with-neural-networks/0028D1883AF842C81340508119AB6491
https://doi.org/10.1016/j.physd.2019.132306
http://arxiv.org/abs/1808.03314
http://arxiv.org/abs/1412.3555
http://arxiv.org/abs/1412.3555
https://doi.org/10.1162/neco.1997.9.8.1735
https://doi.org/10.1162/neco.1997.9.8.1735
https://doi.org/10.1109/TNN.1998.712192
https://doi.org/10.1109/TNN.1998.712192
https://ieeexplore.ieee.org/document/712192
https://ieeexplore.ieee.org/document/712192
https://doi.org/10.1039/D0QO00946F
https://pubs.rsc.org/en/content/articlelanding/2021/qo/d0qo00946f
https://pubs.rsc.org/en/content/articlelanding/2021/qo/d0qo00946f
https://doi.org/10.1126/science.166.3902.178
https://doi.org/10.1126/science.166.3902.178
https://www.science.org/doi/10.1126/science.166.3902.178
https://www.science.org/doi/10.1126/science.166.3902.178
https://www.science.org/doi/10.1126/science.166.3902.178
https://doi.org/10.1002/anie.199526131
https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.199526131
https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.199526131
https://doi.org/10.1038/s42004-023-00911-8
https://doi.org/10.1038/s42004-023-00911-8
https://www.nature.com/articles/s42004-023-00911-8
https://www.nature.com/articles/s42004-023-00911-8
https://doi.org/10.1021/acscentsci.7b00355
https://doi.org/10.1021/acscentsci.7b00355
https://doi.org/10.1021/acscentsci.7b00355
https://doi.org/10.1021/acscentsci.7b00355
https://doi.org/10.48550/arXiv.2001.01408
http://arxiv.org/abs/2001.01408
http://arxiv.org/abs/2001.01408
https://doi.org/10.1186/1758-2946-1-8
https://doi.org/10.1186/1758-2946-1-8
https://doi.org/10.1186/1758-2946-1-8
https://doi.org/10.1109/TCIAIG.2012.2186810
https://doi.org/10.1109/TCIAIG.2012.2186810
https://doi.org/10.1007/s10462-022-10228-y
https://doi.org/10.1007/s10462-022-10228-y
https://doi.org/10.1007/s10462-022-10228-y
https://doi.org/10.1145/2699485


22 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

2699485. [Online]. Available: https://doi.org/
10.1145/2699485 (visited on 03/18/2024).

https://doi.org/10.1145/2699485
https://doi.org/10.1145/2699485
https://doi.org/10.1145/2699485
https://doi.org/10.1145/2699485


APPENDIX 1

Appendix A REINVENT configuration Options

This appendix aims to cover the hyper-parameters and scoring components, explaining their function
and use-case.

A.1 Randomizing SMILES

This is a configuration that allows REINVENT to apply SMILES randomization to its generated com-
pounds, wherein a SMILES is de-canonicalized and instead represented as a random valid SMILES
string. This can, in some cases, facilitate the learning process, as the same molecule can be rewarded,
corresponding to different sequences.

A.2 Diversity Filters

The diversity filters serve to introduce diversity by lowering the scores for repeatedly-seen molecules,
decided on by some metric. In other words, some measure is employed to measure similarity between
molecules, and too similar molecules are penalized. It has a few tunable parameters:

Table 3: Hyperparameters for Diversity Filter in REINVENT

Hyperparameter Description
Diversity Metric The metric used to calculate the similarity between

molecules. Tanimoto similarity on molecular finger-
prints is common.

Score Threshold The score after which generated molecules will be
put into memory, known as the ”bucket”.

”Bucket” Size The size of the memory (number of molecules) for
molecules that exceed the threshold.

Penalty Multiplier A scalar weight w ∈ [0, 1] that is multiplied with the
score, determining the harshness of the penalty.

A.2.1 Penalize Same Smiles

This diversity filter penalizes generation of identical canonical SMILES, which means that multiples of
the same molecules will be scored low. This enforces uniqueness.

A.2.2 Identical Murcko Scaffold

This filter penalizes the generation of molecules that have an identical Murcko scaffold. A Murcko
scaffold is the core structure of a molecule, such as chains and ring structures, but not counting the
substructures and substituents. This filter is thus much more strict in promoting diverse chemistry.

A.3 Scoring Transforms

This section aims to elaborate on the concept of scoring transforms. Given a score S(A) for a sequence
A, consider the transform T : R → [0, 1]. The function T aims to transform the score, which is not
necessarily bounded, to the bounded domain [0, 1]. Since various scores are measured on different scales,
it is necessary to normalize them to such a range. In REINVENT, this is commonly done with a sigmoid
or step function. Some examples will follow.


2 De Novo Molecular Generation of Molecules with Consistent Synthetic Strategy

A.3.1 Parameterized Sigmoid Scoring Transform

The sigmoid scoring transform is a common method for normalizing scores. By parameterizing, it can
be made more expressive. The parameterized sigmoid function can be defined as:

T (x) = 1
1 + ek(x−M) ,

M = L + H

2
.

where x is the input score, k is a steepness factor, L is the lower threshold and H is the upper threshold.
This function maps any real number x to the interval [0, 1]. The sigmoid function is smooth and has an
S-shaped curve, with a midpoint at M . It can be either a left-sigmoid or a right-sigmoid based on the
sign of the exponent. The combination of both creates a double-sigmoid, where a high scoring ”window”
is formed between them.

A.3.2 Truncated Exponential Scoring Transform

Another scoring transform is a truncated exponential function, on the form

T (x) = min(1, exp(−x)).

Which provides different characteristics to the sigmoid function. When considering the Route Distance
Score, which has its best possible value at 0, this scoring has a good fit. Most notably, the maximum
return coincides with the best value of the route distance score, and the slope is concave up, generating
increasing rewards as the score approaches 0.

A.3.3 Lenient Scoring Transform

The lenient scoring transform is a special configuration of the parameterized sigmoid scoring function,
where the steepness factor is set to a very low value. This creates a very slight and gradual slope.

A.3.4 Truncated Inverse Scoring Transform

The truncated inverse scoring transform is written as

T (x) = min(1,
1

1 + x
), x ≥ 0.

Similarly to the exponential scoring function, it has a maximum at x = 0, and a

Appendix B Scoring Components in MPO

The MPO runs performed in this thesis consisted of multiple scoring components, which are explained
here.

B.1 Molecular Weight

The molecular weight scoring component scores the generated molecules based on their molecular weight.
In the presented MPO runs, this was done with a double sigmoid with lower threshold 200 g/mol and
upper threshold 550 g/mol.


APPENDIX 3

B.2 ROCS Similarity

This scoring component evaluates a molecule based on it’s Rapid Overlay of Chemical Structure (ROCS)
Similiarity to a provided Template. It is a measure of structural similarity between a generated molecule
and a provided template structure, returning a higher score for higher similarity.

B.3 Custom Alerts

The custom alers flags specific substructures in a generated molecule by SMILES Arbitrary Target
Specification (SMARTS). For the MPO run it was selected to remove unstable groups and un-physical
ring sizes.

B.4 Quantitative Estimate of Drug-Likeness (QED)

The QED component returns a numerical estimate of drug-likeness for a molecule in the range [0, 1]. A
higher value means that it has a higher estimate of being a drug-like compound.

B.5 Hydrogen Bond Donors

The Hydrogen Bond Donors component measures a molecules fitness to ”Lipinskis Rule of Five”, which
states that a drug compound in general has no more than 5 hydrogen bond donors in its structure. This
is done by measuring the number of Hydrogen Bond Donors in the molecule and taking a right-sigmoid
with upper threshold 6 and lower threshold 2.


	Introduction
	REINVENT
	AiZynthFinder
	Aim

	Theory
	Chemistry
	Representation of Molecular Structure
	Chemical Synthesis
	Retrosynthesis
	Classification of Molecular Reactions
	Tree Edit Distance

	REINVENT
	Recurrent Neural Networks
	Gated Recurrent Unit (GRU)
	RNN training on Next-Token Prediction
	Sequence Generation with RNNs
	Reinforcement Learning
	Policy-Gradient Methods
	REINFORCE
	Fine Tuning Molecular Generators
	Scoring System
	Multiple Parameter Optimization

	AiZynthFinder
	Computer Aided Synthesis Planning
	Monte Carlo Tree Search (MCTS)
	Neural Network guided MCTS


	Methods
	Molecular Generation conditioned on Synthetic Similarity
	Connecting AiZynthFinder and REINVENT
	Route Similarity in REINVENT
	Scoring based on Reference Route
	Emergent Route Distance Score

	Parameter Evaluation
	Multiple Parameter Optimization - Realistic Application

	Results
	Reference Route Scoring Component
	Parameter Search
	MPO

	Emergent Route Scoring

	Discussion
	Reference Route Scoring
	MPO - Real Application

	Emergent Route Scoring
	MPO for the emergent scoring

	Further research

	Conclusion
	Appendix REINVENT configuration Options
	Randomizing SMILES
	Diversity Filters
	Penalize Same Smiles
	Identical Murcko Scaffold

	Scoring Transforms
	Parameterized Sigmoid Scoring Transform
	Truncated Exponential Scoring Transform
	Lenient Scoring Transform
	Truncated Inverse Scoring Transform


	Appendix Scoring Components in MPO
	Molecular Weight
	ROCS Similarity
	Custom Alerts
	Quantitative Estimate of Drug-Likeness (QED)
	Hydrogen Bond Donors