Molecular Optimization using Deep
Learning

Extensions of the Transformer for Molecular Optimization

Master’s thesis in Computer science and engineering

Marcus Forsberg
Felix Mattsson

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2021


Master’s thesis 2021

Molecular Optimization using Deep
Learning

Extensions of the Transformer for Molecular Optimization

Marcus Forsberg
Felix Mattsson

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2021


Molecular Optimization using Deep Learning
Extensions of the Transformer for Molecular Optimization
Marcus Forsberg
Felix Mattsson

© Marcus Forsberg, Felix Mattsson, 2021.

Supervisor: Yinan Yu, Department of Computer Science and Engineering
Advisor: Jiazhen He, AstraZeneca
Examiner: Alexander Schliep, Department of Data Science and AI

Master’s Thesis 2021
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Molecular Optimization of left molecule (CHEMBL3953242), resulting in the
generated molecules displayed to the right (CHEMBL3896788, CHEMBL3936463)

Typeset in LATEX
Gothenburg, Sweden 2021

iv


Molecular Optimization using Deep Learning
Extensions of the Transformer for Molecular Optimization
Marcus Forsberg
Felix Mattsson
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Over the recent years, the development in deep learning has provided new approaches
to molecular optimization. Molecular optimization aims to find structurally similar
molecules to a given starting molecule, yielding specified improvements in terms
of different molecular properties. By representing molecules as SMILES, an ap-
proach to encode molecules as strings of tokens, molecular optimization can be
framed as a machine translation problem, where starting molecules are translated
to molecules with optimized properties. Previous work has shown success for the
Transformer known from natural language processing [1, 2] in the area of molecular
optimization. The thesis covers two extensions of the developed Transformer model
in [1] through curriculum learning and Core-Fixed formulation. Through curriculum
learning, training is structured through a sequence of tasks (curriculum) based on
increasing difficulty. The curriculum could either be determined while training a
model (machine-based) or manually (human heuristic-based). The thesis explores
various approaches to human-based curriculum learning. For the other extension,
Core-Fixed formulation, the thesis provides an approach to reformulating the input
and output of the original model [1], which involves specifying in the input to the
translation model which part that should be fixed (core) and which part that should
be exchanged (R-group) to optimize the complete molecule’s properties. The results
show advantages both in training time and molecule generation performance using
the Core-Fixed formulation. For curriculum learning, the results do not indicate a
clear improvement. The thesis suggests looking into more sophisticated curriculum
learning approaches.

Keywords: Molecular Optimization, Matched Molecular Pairs, Transformer, AD-
MET, Master’s Thesis.

v


Acknowledgements
First and foremost, we would like to thank our supervisor at AstraZeneca, Jiazhen
He, who both created the work for which this entire thesis is based, and who has
helped us through out the project with useful insights and feedback. Without her
work this thesis would not be possible. Additionally, we thank the entire Molecular
AI team of AstraZeneca for their help with setting up the computer environment
and preparing the code for which our implementations are based. In particular, we
would like to mention Vendy Fialkova, Graduate Scientist at AstraZeneca, for her
feedback on the report and Ola Engkvist, the director of the Molecular AI team, for
his continuous inputs that have helped to shape the project.

We would also like to thank our academic supervisor at Chalmers, Yinan Yu, and
our opponent Felix Nordén, for their constructive and useful feedback on the report.
Finally, we thank our families and friends for their continuous support.

Marcus Forsberg, Felix Mattsson, Gothenburg, February 2021

vii


Abbreviations
The following list shows frequent abbreviations used in this report.

• SMILES - Simplified Molecular Input Line Entry System
• MMP -MatchedMolecular Pair (molecules that differ only by a single trans-

formation)
• NLP - Natural Language Processing
• AZ - AstraZeneca
• ADMET - Absorption, Distribution, Metabolism, Excretion and Toxicity
• CL - Curriculum Learning
• CF - Core-Fixed

ix


x


Contents

List of Figures xv

List of Tables xvii

1 Introduction 1
1.1 Drug Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges in Drug Discovery . . . . . . . . . . . . . . . . . . . . . . 1
1.3 String Based Molecular Representation . . . . . . . . . . . . . . . . . 2
1.4 Project Description: Extensions of Previous Model . . . . . . . . . . 3

1.4.1 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Core-Fixed Molecular Optimization . . . . . . . . . . . . . . . 4

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 7
2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Token Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.6 Sequence Generation . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Transformer for Molecular Optimization . . . . . . . . . . . . . . . . 12
2.2.1 Considered Molecular Properties . . . . . . . . . . . . . . . . 12

2.3 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Difficulty Assessment . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Curriculum Arrangement . . . . . . . . . . . . . . . . . . . . . 15

3 Methods 17
3.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Preparation of Matched Molecular Pairs . . . . . . . . . . . . 17
3.2.2 Property Representation . . . . . . . . . . . . . . . . . . . . . 18

3.3 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Simulation of Missing Property Values . . . . . . . . . . . . . 19
3.3.2 Difficulty Assessment . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.3 Curriculum Arrangement . . . . . . . . . . . . . . . . . . . . . 21

xi


Contents

3.3.4 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.5 Training and Validation . . . . . . . . . . . . . . . . . . . . . 23
3.3.6 Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.7 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Core-Fixed Molecular Optimization . . . . . . . . . . . . . . . . . . . 25
3.4.1 Input and Output Representations . . . . . . . . . . . . . . . 25
3.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.3 Training and Validation . . . . . . . . . . . . . . . . . . . . . 25
3.4.4 Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Results 29
4.1 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Comparison of Exploratory Difficulty Assessments . . . . . . . 29
4.1.1.1 Training and Validation . . . . . . . . . . . . . . . . 29
4.1.1.2 Sample-Based Molecule Generation . . . . . . . . . . 31

4.1.2 Comparison of Property-Based with Comparative Difficulty
Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2.1 Training and validation . . . . . . . . . . . . . . . . 32
4.1.2.2 Sample-Based Molecule Generation . . . . . . . . . . 34

4.1.3 Computational Time . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Core-Fixed Molecular Optimization . . . . . . . . . . . . . . . . . . . 35

4.2.1 Training and Validation . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Deterministic Molecule Generation . . . . . . . . . . . . . . . 36
4.2.3 Sample-Based Molecule Generation . . . . . . . . . . . . . . . 37
4.2.4 Novel R-Group Samples . . . . . . . . . . . . . . . . . . . . . 40
4.2.5 Example of Baseline’s Inability to Keep the Core . . . . . . . 40
4.2.6 Computational Time . . . . . . . . . . . . . . . . . . . . . . . 41

5 Discussion 43
5.1 Curriculum learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 Training and Validation . . . . . . . . . . . . . . . . . . . . . 43
5.1.2 Molecule Generation . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Core-Fixed Molecular Optimization . . . . . . . . . . . . . . . . . . . 45
5.2.1 Training and Validation . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Molecule Generation . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusion 47

Bibliography 49

A Additional Results for Curriculum Learning I
A.1 Choice of Epoch Numbers for Difficulty Assessments . . . . . . . . . I
A.2 Additional Comparative Results . . . . . . . . . . . . . . . . . . . . . III

xii


Contents

B Additional Results for Core-Fixed Transformer V
B.1 Required Number of Samples to Generate Molecules . . . . . . . . . . V
B.2 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . V

C Testing for Significant Difference in Distributions VII

xiii


Contents

xiv


List of Figures

1.1 Comparison of machine translation and molecular optimization . . . . 2
1.2 The structural formula and SMILES-string for the molecule caffeine. 3
1.3 An example of matched molecular pair and corresponding properties. 4

2.1 The Transformer architecture introduced in [2] . . . . . . . . . . . . . 8
2.2 An example of input and output of the previous model [1]. . . . . . . 13
2.3 Visualization of difficulty based buckets in curriculum learning . . . . 16

3.1 All encoded property change tokens . . . . . . . . . . . . . . . . . . . 18
3.2 Visualization showing how the data was masked to simulate practical

scenario of missing properties. . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Training data division for Property-, Length- and Token Rarity based

difficulties with 4 buckets . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Input and output chain of the Core-Fixed model . . . . . . . . . . . . 26

4.1 Training and validation loss over training time for the Exploratory
difficulty assessments in curriculum learning . . . . . . . . . . . . . . 30

4.2 Validation accuracy over training time and smoothed validation ac-
curacy over epochs for the Exploratory difficulty assessments in cur-
riculum learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Training and validation loss over training time for the Comparative
difficulty assessments in curriculum learning . . . . . . . . . . . . . . 33

4.4 Validation accuracy over training time and, smoothed validation ac-
curacy over epochs for the Comparative difficulty assessments in cur-
riculum learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Training hours required for each of the Exploratory and Comparative
models, including the baseline. . . . . . . . . . . . . . . . . . . . . . . 35

4.6 Train and validation loss over epoch, and validation accuracy over
epoch, for the Core-Fixed model . . . . . . . . . . . . . . . . . . . . . 36

4.7 Number of generated molecules with desirable properties per source
molecule when using the Core-Fixed model and its associated Baseline 39

4.8 Top 10 most frequent novel R-groups found by the Core-Fixed model 40
4.9 Example of a source molecule for which the baseline failed to keep

the core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.10 Generations based on the source molecule in Figure 4.9 for which the

Baseline failed to keep the core . . . . . . . . . . . . . . . . . . . . . 41

xv


List of Figures

A.1 Validation loss and validation accuracy for the three difficulty assess-
ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.2 Comparative validation loss and validation accuracy for the length
and token based difficulty assessments . . . . . . . . . . . . . . . . . III

B.1 Distribution of number of samples required to generate 10 unique and
valid molecules for the Core-Fixed model and its Baseline . . . . . . . V

B.2 Training loss, validation loss and validation accuracy over training
epoch for the 8 hyperparameter combinations yielding the best per-
forming Core-Fixed Transformer . . . . . . . . . . . . . . . . . . . . . VI

xvi


List of Tables

2.1 Hyperparameters for the Transformer used in [1] . . . . . . . . . . . . 14

3.1 The amount of molecules and molecular pairs still having its proper-
ties after the masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Comparison of the generation performance of the Explorative diffi-
culty assessments (DA) and the corresponding baseline . . . . . . . . 31

4.2 Comparison of the generation performance of the Comparative diffi-
culty assessments and the corresponding baseline . . . . . . . . . . . 34

4.3 Comparison of the Core-Fixed model and its Baseline, when using
greedy decode with one generated molecule per starting molecule . . 36

4.4 Comparison of the performance of the Core-Fixed model and its cor-
responding baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Comparison of run-times for Core-Fixed model and corresponding
baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

B.1 The 8 hyperparameters that yielded the best performing Core-Fixed
Transformer models . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

xvii


List of Tables

xviii


1
Introduction

In this chapter we give a background of molecular optimization and the work that
this thesis is based on. Furthermore, we briefly introduce the concept of representing
molecules as SMILES strings in Section 1.3, and the main tasks that we consider in
this thesis in Section 1.4.

1.1 Drug Discovery
The use of drugs as medication by humans have been around for thousands of years,
with one early example being the use of herbal medicines, dating back to the Stone
Age [3]. Ever since then, more and more drugs have been discovered, and as our
society has grown and made significant advancements in science and technology, so
have our methods for discovering new drugs. Historically, drugs have been discov-
ered by identifying the active substances in already known remedies. More recently,
the methods have been expanded by large-scale companies, through usage of chem-
ical databases, clinical trials, advancements in biochemistry and much more. [4]

When designing a drug, one typically aims for a good balance of different molec-
ular properties. Common considerations are molecule’s physiochemical properties,
absorption, distribution, metabolism, elimination and toxicity (ADMET) proper-
ties. [1] Typically, a promising molecule needs to be improved to gain a balance
between multiple properties – a problem known as molecular optimization. Tra-
ditionally, molecular optimization has relied on chemists utilizing their knowledge
and intuition to apply chemical transformations to a promising molecule. Here the
matched molecular pair (MMP) analysis [5, 6], which compares the properties of
two molecules that differ only by a single chemical transformation, has been used
extensively [7, 8, 9]. The chemist’s approach is based on the assumption that similar
molecules possess similar properties, which has been useful to make approximations
of molecules’ properties through history.

1.2 Challenges in Drug Discovery

Considering the drug-like space is estimated to contain 1023 − 1060 molecules [10],
an exhaustive search for potential drugs is not preferred. Moreover, only 108 sub-
stances have ever been synthesized [11]. At AstraZeneca (AZ), developing suitable
models to find potential drugs is of high relevance since it can help accelerating the

1


1. Introduction

drug-development process and in the end save lives.

The recent development in deep learning has enabled further potential for find-
ing molecules with desirable properties. Previous deep learning approaches have,
however, often ignored the domain knowledge of chemical transformations. At As-
traZeneca, a recent deep learning model has been developed, and is introduced in [1].
It is based on the chemical transformations (i.e. MMPs), which reflect the chemist’s
intuition. The molecular optimization problem is framed as a machine translation
problem in natural language processing (NLP), where a promising molecule, repre-
sented as a string, is translated into a similar molecule with optimized properties,
much like how a sentence in English would be translated into a sentence in French,
see Figure 1.1. More specifically, the work at AstraZeneca [1] showed potential for
a model based on the Transformer, which is a neural network architecture used in
NLP [2].

Promising Molecule
(CHEMBL3685029)

Target Molecule
(CHEMBL3685059)

Language Model
”Molecules rule

the world”
English

” Les molécules 
dominent le 

monde”
French

Molecular
Optimization Model

Figure 1.1: Comparison of machine translation and molecular optimization. When
a promising molecule is represented as a string, it can be translated to a similar
molecule with optimized properties using the Molecular Optimization model.

1.3 String Based Molecular Representation
When working with molecules, a common way to describe them is through their
molecular formula. The molecular formula is a string representation of the molecule
which contains information about what atoms the molecule consists of. As an exam-
ple we have H2O, which is the molecular formula for water, and it states that water
consists of two hydrogen atoms and one oxygen atom. This is an easy representation
of a molecule which tells us a lot, but one thing that it does not tell us is how these
atoms are connected to each other, in other words, it does not give us any structural
information.

To include the structural information, while keeping it as a string, which is suitable
if we want a program to read it, one can use the representation Simplified Molecular
Input Line Entry System, or in short, SMILES [12]. With SMILES, any molecule

2


1. Introduction

can be represented as a string of ASCII-symbols. It is similar to the molecular
formula, but extra symbols are added between the atoms to show how they are
connected. One example of a SMILES-string is shown in Figure 1.2, together with
its corresponding structural formula. In the standard version of the SMILES speci-
fication, any molecule can be represented as a string, but the representation is not
unique, meaning that multiple strings can correspond to the same molecule. This
can easily be corrected for if needed, making each molecule have one and only one
string-representation. That representation is referred to as the canonical SMILES
for the molecule [12].

CN1C=NC2=C1C(=O)N(C(=O)N2C)C

Figure 1.2: The structural formula and SMILES-string for the molecule caffeine.

1.4 Project Description: Extensions of Previous
Model

In this thesis the aim is to extend and improve the Transformer based model intro-
duced by [1]. This improvement will be in two aspects. The first one is to apply
curriculum learning [13] to the model. This is a way of training a model through
“starting small”, meaning that the model only trains on an easy subset of the data,
to later incorporate difficult data. This is with the hopes of it making the train-
ing of the model faster and the accuracy better. The second aspect that we will
cover, Core-Fixed formulation, will be about only optimizing a specific part of the
molecule, leaving the other part unchanged under the transformation.

1.4.1 Curriculum Learning
Humans and animals tend to learn better when the content to be learned is pre-
sented in a meaningful order, a curriculum, usually starting with easy and basic
concepts, to later move on to more advanced ones. Motivated by this, research has
been performed to see if something similar also holds for the learning of machines.
The idea was popularized by Bengio in 2009 [13], but was investigated already in
1993 [14]. After this, studies have been done where curriculum learning has been
successfully applied in computer vision [15], and more recent studies have also suc-
cessfully applied it to language models [16, 17, 18, 19].

3


1. Introduction

In these studies, many advantages of utilizing the concept of a curriculum are men-
tioned, with the most frequent being that it can increase learning speed and improve
final performance on test data by finding a better local optimum from the same
training data. Other advantages mentioned are that it can help in generalization
and that it gives universal performance improvements on a wide range of natural
language understanding tasks.

1.4.2 Core-Fixed Molecular Optimization
The original model produced in [1] was trained to translate the entire SMILES
of the source molecule into the entire SMILES of the generated target molecule.
A disadvantage of this approach is that there is no guarantee that the generated
target molecule will be a matched molecular pair with the source molecule, i.e.
that the source and generated molecules differ only by a single transformation.
With the chemist’s approach, one could be interested in keeping one part of the
molecule (core), while exchanging the other (R-group) (Figure 1.3), in order to
accomplish a molecule with optimized properties. Motivated by this we will train a
model, identical to the original model, except that it will generate only the target
R-group instead of the whole molecule, given a source core and R-group . Since the
SMILES sequence of the target R-group is shorter than the SMILES sequence of the
entire target molecule, such a model might yield shorter training time and improved
performance over the original model.

Core R-Group

Transformation

Source Molecule
(CHEMBL3685029)

Target Molecule
(CHEMBL3685059)

LogD: 4.14
Solubility: -0.01
Clearance; 1.68

LogD: 2.75
Solubility: 0.72
Clearance; 1.21

Properties

Figure 1.3: An example of matched molecular pair and corresponding properties.

1.5 Thesis Outline
In this introductory chapter, we have given a brief context of this thesis. In Chapter
2, we introduce the Transformer model and give a summary of the previous work
which this thesis is built on. We also give a conceptual background to curriculum
learning, which we use in one of the extensions we are considering. In Chapter 3,
we present the method for accomplishing our two considered extensions curriculum
learning and Core-Fixed formulation. This includes preparation of the data and how
each model is trained and evaluated. In Chapter 4 we review the results, which the

4


1. Introduction

discussion in Chapter 5 is built on. Finally, Chapter 6 gives the main conclusions
of the work.

5


1. Introduction

6


2
Theory

In this chapter we describe the Transformer architecture, and give a summary to
the specific Transformer model that our two extensions are based on. We also give
a conceptual background to curriculum learning.

2.1 Transformer

The Transformer architecture was introduced by [2] and was shown to yield state-of-
the-art performance within NLP. Through comparison of the BLEU-score, a metric
describing the quality of translating a sequence between languages [20], the article
showed that the Transformer outperformed its alternatives when translating be-
tween English and German. Besides the typical NLP tasks, the Transformer has
been shown to perform in other sequence tasks, such as time series forecasting [21].

The input and output of the Transformer consist of token sequences. In the area
of NLP the input could represent a sentence in one language that is fed into the
Transformer which then generates the sentence’s translation in a different language.
The Transformer architecture consists of an encoder and a decoder stack, visualized
in Figure 2.1. The encoder stack could be thought of as a mapping from a sequence
of tokens in the token space, to a sequence of continuous representations. With
the sequence of continuous representations from the encoder, the decoder generates
one token at a time based on the previously generated tokens. Note that both the
encoder and decoder stacks take each token’s sequence position into account using
a so called positional encoding.

2.1.1 Token Embedding

To numerically deal with the typically text based tokens, one early popular approach
is one-hot-encoding. One-hot-encoding consists of letting each possible token be
represented by a NT oken-dimensional vector of NT oken−1 ‘0’s and a single ‘1’, where
NT oken represents the size of the token space. By construction the position for which
the ‘1’ occurs will be unique for each token, see (2.1).

7


2. Theory

Figure 2.1: The Transformer architecture introduced in [2]. The left part repre-
sents the encoder stack and the right part the decoder stack.

Token 1 :
NT oken︷ ︸︸ ︷

[1 0 · · · 0] (2.1)
Token 2 : [0 1 · · · 0]

...
Token N : [0 0 · · · 1]

The computational complexity is reduced by employing an Input Embedding which
limits the dimensionality of the input data presented to the Transformer. Here the
Input Embedding is trained together with the rest of the model. Using this em-
bedding, it is further possible for the Transformer to exploit structural similarity
between tokens, as can be seen in language applications where representations of
similar words commonly lie close to each other in the embedded space. The In-
put Embedding converts each token from the token space into a continuous dmodel-
dimensional vector.

8


2. Theory

2.1.2 Positional Encoding
For the Transformer to understand the order of the tokens in the inputs, positional
encoding is necessary [2]. This is handled by, for token at position pos, adding

PE(pos, i) =


sin

(
pos

10000i/dmodel

)
, for i even

cos
(

pos

10000(i−1)/dmodel

)
, for i odd

(2.2)

elementvise for i ∈ [1, dmodel] to the continuous representation vector yielded by the
input embedding.

2.1.3 Attention
As seen in Figure 2.1, both the encoder and decoder stacks of the Transformer relies
on Multi-Head Attention layers. A Multi-Head Attention layer consist of multiple
so-called Scaled Dot-Product Attention heads.

Scaled Dot-Product Attention A Scaled Dot-Product Attention head uses three
matrices WQ, WK and W V of dimensions dmodel × dk, which are learned during
model training. Here the hyperparameter dmodel comes from the dimension of the
Input Embedding, and dk is a hyperparameter for the attention layer. With WQ,
WK and W V , the query matrix Q, key matrix K and value matrix V are calculated
using

Q = XWQ

K = XWK

V = XW V ,

where row i in X equals the (current) continuous representation of token i. Letting
n correspond to the number of tokens, X would thus be of dimension n× dmodel.
Using Q, K and V the Scaled Dot-Product Attention is calculated,

Attention(Q,K, V ) = softmax
(QKT

√
dk

)
V, (2.3)

where softmax is the activation function defined, for a vector index j, through

softmaxj(z) = ezj∑dmodel
k=1 ezk

.

The usage of the scaling factor 1√
dk

in (2.3) is intended to counteract the risk of
obtaining small gradients during back-propagation. [2]

Multi-Head Attention With the usage of multiple Scaled Dot-Product Attention
heads we obtain Multi-Head Attention. Letting ZHEADi

represent the output from
Scaled Dot-Product Attention head i, the Multi-Head Attention is obtained by

Multi-Head Attention = [ZHEAD1 ZHEAD2 · · · ZHEADh
]WO,

9


2. Theory

where h corresponds to the number of Scaled Dot-Product Attention heads and WO

is a hdk × dmodel matrix that is learned during training. Note that each head i is
associated with unique WQ

i ,W
K
i and W V

i matrices, and that the hyperparameter
dk, describing the dimensionality of these, is shared among the heads. With WO,
the Transformer model can learn how to weigh the outputs from the different Scaled
Dot-Product Attention heads, where each head can be thought to capture different
qualities of the token sequence.

2.1.4 Encoder
The encoder stack consists of multiple stacked encoder layers, where the input to the
first layer is the continuous representation after the positional encoding mechanism.
Each encoder layer itself consists of two sub-layers, firstly Multi-Head Attention -
described in the previous section - and secondly a positionwise fully connected feed-
forward network. The feed-forward network in a specific encoder layer is represented
by

FFN(x) = max(0, xW1 + b1)W2 + b2, (2.4)

which we recognize as two linear transformations with a ReLU activation in between.
Note that the max function works element wise on the vector xW1 + b1. To match
the input and output dimensions, the W1 matrix is of dimension dmodel × dff , the
vector b1 of dimension dff while matrix W2 is of dimension dff × dmodel and vector
b2 is of dimension dmodel. Here dff represents an additional tuning parameter.

Both sub-layers are used within a residual connection, which in token i yields the
output yi from the input xi according to

(yi)j = (xi + Sublayer(xi))j − µj

σj

,

where j is the component and Sublayer represents the function of a sub-layer (i.e.
FFN or Mult-Head Attention), and where

µj = 1
ntoken

ntoken∑
i=1

(xi + Sublayer(xi))j,

σj =

√√√√ 1
ntoken

ntoken∑
i=1

(
(xi + Sublayer(xi))j − µj

)2
.

Here ntoken represents the number of tokens in our input sequence. This technique is
known as layer normalization and was introduced for sequence based neural networks
in [22] to reduce training time. This computation is represented by Add & Norm in
Figure 2.1.

2.1.5 Decoder
Like the Encoder, the Decoder stack of the Transformer network consists of a stack
of N decoder layers. Each decoder layer employs similar sub-layers as the ones of the

10


2. Theory

encoder. Apart from the Multi-Head Attention sub-layer and the feed-forward net-
work, the decoder also uses an additional Multi-Head Attention sub-layer, denoted
by Masked Multi-Head Attention in Figure 2.1. The Masked Multi-Head Attention
works on the token sequence that have been generated so far. For this Multi-Head
Attention to not use output tokens that have not been generated yet the model
fixes the weights of WQ, WK and WK to −∞ corresponding to the (as of yet) not
generated tokens. Note that every sequence that is generated by the decoder, starts
with a common start token, which is treated by the decoder as if it was the first
generated token.

Following the Masked Multi-Head Attention sub-layer is another Multi-Head Atten-
tion sub-layer which takes both the output from the previous decoder layer, as well
as the output of the encoder, as input. Like with the encoder, the final layer in the
decoder is a feed-forward network as defined in (2.4). For each decoder sub-layer
the Transformer also employs the layer normalization, much like the encoder.

2.1.6 Sequence Generation
The output of the decoder stack is fed into a single linear layer which yields a vector
of dimension dmodel. That output vector is then fed into a softmax function which
converts the continuous vector to output probabilities in the output token space.
The output probabilities are used to generate a tokens, typically with one of the
following techniques:

Greedy Decode - The token with highest probability is chosen. After a token has
been chosen the decoding procedure repeats to select the next token with the highest
probability given the previous ones. Note that this choice of decoding procedure is
deterministic, meaning that the model will always generate the same sequence for a
given input.

Multinomial Decode - A token is sampled based on interpreting the token prob-
abilities as a discrete distribution. As for greedy decode, the decoding procedure is
repeated to select the next token given the previous ones, after a token has been
chosen.

Beam Search - Using the beam width B, token probabilities are generated for each
of the B most likely token sequences from the previous step. What determines the
(model-estimated) conditional probability of a token sequence is the product of each
of the tokens given the previous ones, i.e.

P (t0, ..., tN−1|x) = P (t0|x)
N−1∏
i=1

P (ti|x, t0, ..., ti−1),

where x constitutes the input sequence, tj for j ∈ [0, N − 1] represents the token at
position j and N the length of the token sequence. When B sequences have been
generated, the token sequence with the highest conditional probability is chosen.

11


2. Theory

Note that for the model to start and stop generating tokens appropriately, common
start and end tokens are used. The start token is used as the first generated token
in the Outputs in the Output embedding, see Figure 2.1. The end token is used by
the model to determine when to stop generating new tokens.

2.2 Transformer for Molecular Optimization
The article [1] showed potential for the Transformer within molecular optimiza-
tion, where a source molecule is transformed in to a similar molecule, called target
molecule with new, desirable molecular properties, defined as input together with
the source molecule. What molecular properties where considered is explained in
the following section.

2.2.1 Considered Molecular Properties
As a proof of concept, the Transformer developed at AZ is considering the three
ADMET properties LogD, Solubility and Clearance:

LogD - LogD which is short for the logarithm of the partition coefficient, is a mea-
sure of the lipophilicity of the molecule, or in other words its potential to dissolve
in lipids, fats and other non-polar solvents. This is relevant in drug discovery since
a molecule with high lipophilicity is more likely to penetrate cell membranes [23].
A too high LogD could, however, be toxic.

Solubility - Generally, Solubility is a measure of a molecules capacity to dissolve
in a certain solvent. In this project, as in [1], we take the solvent to be water.
Molecules that have high Solubility (in water) tends to be lipophobic, meaning that
they have low potential to dissolve in lipids, fats and other non-polar solvents. This
means that this measure gives us similar information as the LogD measure, but in
a hydrophilic manner instead.

Clearance - Clearance is a measure of how fast the substance will be removed from
the patient’s body, giving essential information related to metabolic stability and
dosing of the drug.

2.3 Model Description
The original Transformer model for molecular optimization was trained on a sub-
set of the MMPs extracted from ChEMBL. Each MMP was represented by source
and target molecules as their SMILES-strings. The input of the model consisted of
source SMILES-string concatenated with property-change-tokens, which encodes the
desirable property changes in the previously mentioned molecular properties LogD,
Solubility and Clearance, while the output consisted of the entire SMILES-string of
the target molecule. To represent these property changes as a finite set of tokens,
each relevant property change was assigned a category represented by a unique token.

12


2. Theory

Figure 2.2 shows an example of an input and output of the Transformer model in [1].

Source Molecule:
(CHEMBL3685029)

Target Molecule:
(CHEMBL3685059)

Input: Source Sequence

Property Constraint Source SMILES

LogD_Change_(-1.1, -0.9] Solubility_low->high CLlint_high->low

O=C(Nc1cc2ccc(C3CCCC3)cc2cn1)C1CC1

=

Transformer

Output: Target SMILES

O=C(Nc1cc2ccc(-c3cncnc3)cc2cn1)C1CC1

Figure 2.2: An example of input and output of the previous model [1].

With the training set D, the model was trained to minimize the Negative-Log-
Likelihood

NLL(θ) = −
∑
i∈D

|yi|∑
t=1

logP (ŷi,t = yi,t|yi,1, ..., yi,t−1,xi; θ), (2.5)

where θ represents the model weights and xi the input sequence i. Here |yi| repre-
sents the target length and yi,t the token at position t for the target i. Note that
ŷi,t represents the generated token at position t, conditioned on the model weights,
input and true previous tokens yi,t. This technique of conditioning the generation
of the next token based on the true previous tokens, rather then the previously gen-
erated ones, is known as teacher-forcing [24].

13


2. Theory

The updating of the Transformer weights through backpropagation used the ADAM-
optimizer [25] with learning rate 0.0001 along with a batch size of 128. Furthermore,
the internal neural networks used a dropout [26] of 10%. After each training epoch,
i.e. one pass through the entire training set, the model was validated using (2.5) with
the previously mentioned validation set to confirm generalization improvement over
the last epoch. The model was trained for a total of 60 epochs after which it was
compared to two alternative sequence generation models; based on Seq2Seq [27] and
HierG2G [28]. The evaluation was based on letting the models generate 10 target
molecules for each input in the test set, consisting of SMILES sources concatenated
with property change tokens. For the Transformer to accomplish a range of gener-
ated molecules the model used multinomial decode, described in Section 2.1.6, as
it was shown to yield similar performance to the more computationally heavy but
typically more accurate Beam-search.

To estimate the properties of the generated molecules, and thus validate which
molecules satisfy the specified property changes, AstraZeneca’s internal property
prediction models were used. The Transformer model outperformed its alternatives
in terms of how many of the generated molecules that satisfied the desired property
changes.

The hyperparameters that were used for the Transformer in [1] are seen in Table
2.1.

Table 2.1: Hyperparameters for the Transformer used in [1]

Parameter Value Description
dmodel 256 Dimension of continuous representation for the input

and output encoding
dff 2048 Dimension for W1 and W2 in the fully connected

neural network in (2.4)
dk 64 Number of rows in Q, K and V for the attention

mechanisms
N 8 Number of layers for the Encoder and Decoder
h 6 Number of heads for the attention mechanisms in the

encoder and decoder

2.4 Curriculum Learning

The idea with curriculum learning is to “start small”, meaning only to train on a
subset of the input-output pairs in the training dataset that is easy in earlier epochs,
and to incorporate more difficult data in later epochs. This is opposed to the usual
method of training on all available training data in each epoch.

14


2. Theory

2.4.1 Difficulty Assessment

To define a curriculum, a difficulty score is used. The difficulty of an input-output
pair does not have a unique definition and can be assessed in multiple ways, but is
supposed to reflect how difficult it is for the model to learn from it. One example of
an assessment of the difficulty when training a model for language translation from
one language to another could be to assess the difficulty of an input-output pair as
the number of words in the input sentence [18], meaning that an input-output pair
with input sentence “Hi” would have a difficulty of 1, and if it instead would be “I
am home” it would have a difficulty of 3. The actual number stating the difficulty
here is a bit arbitrary, and it is only the induced order of the input-output pairs that
we need. Other ways to put a difficulty score on each pair can be by considering
the frequency of the pair or its parts in the training data. In the case of sentences,
this could be that the more uncommon words there is in a sentence, the higher the
difficulty score would be for it [18].

The difficulty assessments discussed above is what could be called human heuris-
tic based, meaning the difficult is based on what we as humans consider difficult.
Another way of doing this, which could be called machine based, is to decide the
difficulty automatically by looking at what data the model learns easily from and
not, e.g. the loss induced by the training data. In this project we only investigate
the human heuristic based difficulty assessments.

2.4.2 Curriculum Arrangement

When the difficulty score is set for each input-output pair, the pairs are divided into
N Buckets Ci, i ∈ 1, 2...N . The first bucket C1 only includes a subset of the pairs
that have the lowest difficulty score. Consecutive buckets after this include data that
is more difficult than what was in the previous bucket, and this procedure goes on
until all data have been distributed across the buckets. The number of buckets N in
the curriculum, as well as the number of data points in each bucket, are sometimes
naturally taken from the difficulty scores. In the example of when the difficulty was
assessed by the length of the input sentence, it is natural to have one bucket for each
available length in the training data, where each bucket would contain all data that
has the corresponding length. This can in general be chosen more arbitrarily as well.

With these buckets, the training will be split up into steps, one step per bucket,
where each step progress for some number of epochs. In step i, the data used for
training will be all the buckets up to the i:th bucket, or more formally, the data used
in training step i is C1 ∪ C2 ∪ ... ∪ Ci. Figure 2.3 visualizes this split into buckets
and what data is used in each step of the training. Worth noting here is that since
each consecutive step includes more data, one epoch in an early step will most likely
take less time to go through, but also since those steps use less data, the model will
most likely learn less in one epoch and thus need more epochs to converge.

15


2. Theory

C1 C2 C3 CN
. . .

Difficulty
Easy Hard

Step N

Step 3
Step 2

Step 1

Figure 2.3: Visualization of difficulty based buckets in curriculum learning. Easy
data is put into some buckets and difficult data into others. The training is done
by starting training with only the easy buckets, later to incorporate more difficult
ones.

The potential advantages of utilizing curriculum learning is discussed in Section
1.4.1. Based on that discussion, we decide on looking at the two main potential
advantages mentioned there, namely:

• If it can result in shorter training times
• If it can improve final performance on test data

These advantages will be investigated within molecular optimization using various
manually defined curricula, as we will come back to later.

16


3
Methods

In this chapter we will focus on introducing the experimental settings used to eval-
uate the models in further chapters. We will go into detail on how we train and
evaluate the models, including how we create and work with data.

3.1 Platform
The implementations for preprocessing, training, generating and evaluating the de-
scribed experiments are written in Python. Our implementations are highly based
on the ones used for accomplishing the experimental results in [1]. The Transformer
model is built using PyTorch, based on Torch, with inspiration from [29]. The GPU
used for training the Transformer models was a Tesla V100.

3.2 Data Preparation
In this section we present the data that will be used, including how it is preprocessed
and how the molecular properties are represented for the models.

3.2.1 Preparation of Matched Molecular Pairs
To begin with, we will work with matched molecular pairs (MMPs) extracted from
ChEMBL [30]. This was done using an open-source matched molecular pair tool
[31]. The molecules were then standardized using the molecule validation and stan-
dardization tool MolVS [32]. Note that this constitutes the same molecule pairs as
were used in the original Transformer for molecular optimization [1]. To make the
model work with high-quality and drug-like compounds, the original Transformer
used a subset of the molecular pairs satisfying a number of constraints, which are
stated below:

• The number of heavy atoms, i.e. any atom except hydrogen, of the core is less
than 50

• The number of heavy atoms in R-group is less than 13
• The ratio of heavy atoms in the R-group to the entire molecule is less than

0.33
• The number of H-bond donors in the R-group is less than 3
• The number of H-bond acceptors in R-group is less than 3
• AstraZeneca’s AZFilter “CORE” [33] to filter out bad-quality compounds

17


3. Methods

• Each molecule’s property values are within 3 standard deviations of all molecules’
property values

ChEMBL provides 9,927,876 molecular pairs for which the constrains are satisfied.
Out of these we will use a 2% random selection to limit the training times. Among
these we will use 81% for training, 9% for validation and 10% for testing. In ab-
solute numbers, this corresponds to 160,831 pairs, 17,871 pairs, and 19,856 pairs
respectively.

3.2.2 Property Representation

For each molecule in a molecular pair, the three ADMET properties LogD, Solubil-
ity and Clearance are calculated using property prediction models developed in [1].
These models are based on message passing neural networks [34] trained on in-house
experimental data. For each MMP, the changes in the molecule’s property values
are calculated, which are lastly encoded as discrete property change tokens. Con-
sidering practical desirable criteria and experimental errors, the change in Solubility
and Clearance are encoded into one out of three tokens, while the change in LogD
is encoded into one out of 60 tokens representing different value intervals.

Each MMP is assigned the tokens which correspond to the intervals associated with
the property changes. For LogD the token will represent the interval for which the
LogD-change lies within. For Solubility and Clearance respectively, there are only
three intervals/tokens. If the value (Solubility or Clearance) is higher in the second
molecule of the pair, the pair gets the token “low->high” for that property, if the
value is higher in the first molecule of the pair, the pair gets the token “high->low”.
Since there will be an uncertainty in the individual molecules’ property change
estimates, Solubility and Clearance have fixed thresholds which are used to determine
if the change is significant. For absolute changes lower than the corresponding
property threshold the pair is assigned the token “no change” for that property.
Figure 3.1 shows all the possible property tokens.

LogD Change

LogD_change_(-inf,-5.7],
…,
LogD_change_(-0.3,-0.1],
LogD_change_(-0.1,0.1],
LogD_change_(0.1,0.3],
…,
LogD_change_(5.9,inf]

Solubility Change

Solubility_low->high,
Solutiliity_high->low,
Solubility_no_change

(Threshold for low/high
Solubility: Log10(50μM)=1.7)

Clearance Change

CLint_low->high,
CLint_high->low,
CLint_no_change

(Threshold for low/high
Clearance: 

Log10(20μL/min/mg)=1.3)

Figure 3.1: All encoded property change tokens. Thresholds for high/low Solubil-
ity and Clearance are given in the figure.

18


3. Methods

3.3 Curriculum Learning
In the following sections the approach to accomplishing models based on training
using curriculum learning is presented. This includes the various types of curriculum
learning settings that are considered, and how the resulting models are trained and
evaluated.

3.3.1 Simulation of Missing Property Values
As stated earlier, each molecule has a value for the three different properties LogD,
Solubility and Clearance, which are all estimated using internal property prediction
models. The accuracy for the estimates is not optimal however, and in practice the
properties will be measured experimentally. When using the available experimental
data, it is not always the case that all the properties have been measured, e.g. some
molecules only have LogD measured but not Solubility or Clearance. In the previous
way of training the model, a molecule that is not complete, i.e misses at least one of
the property values, will not be able to be used at all in training, although most of
the information is still there. With this in mind, we start off by masking our data,
meaning that we remove some of the property values for the molecules. This makes
it more similar to the practical scenario. Furthermore, we adapt the implementation
to only include the property change tokens for which the actual property change is
available for a specific MMP.

Note that this procedure of masking would also make sense to do in the other part of
the project, Core-Fixed molecular optimization, but we decided to use the original
data there, to make it easier to compare to the original model. Another advantage
of doing this in the context of curriculum learning is that we get another natural
way of assessing the difficulty, namely by defining the difficulty by the number of
available properties, which we will come back to later.

The molecules were masked according to the following procedure: Firstly all unique
molecules, out of both source and target molecules were listed, then 20% of the
molecules got their LogD-value masked, 30% of the molecules got their Solubil-
ity-value masked and 40% of the molecules got their Clearance-value masked. The
choice of what molecules were masked was random, and independent on the masking
of the other properties. Figure 3.2 shows all different combination of properties that
a molecule can have after the masking as an Venn-diagram, with areas proportional
to the amount of molecules in each set. A molecular pair will then have a certain
property masked if at least one of the molecules it constitutes of has it masked.
Table 3.1 shows the number of each type of molecule after masking, where each
number in the “molecule” column corresponds to a closed area in the Venn-diagram.

Since 0.8 of the molecules have a LogD-value and 0.7 of the molecules have a Sol-
ubility-value, and since the masking of a property was independent of the masking
of another property, the amount of molecules that have both LogD and Solubility
will be approximately 0.8 · 0.7 = 0.56, which is what we see in Table 3.1. And for a

19


3. Methods

molecule pair to have a LogD-value, both of the molecules that it consists of needs
to have it, which they do with a probability of 0.8 · 0.8 = 0.64. This correspondence
is also seen in Table 3.1. With similar reasoning, all the (expected) proportions
shown as percentages in Table 3.1 can be calculated.

Known
LogD

Known
Clearance

Known
Solubility

Molecules

Figure 3.2: Visualization showing how the data was masked to simulate practical
scenario of missing properties.

Table 3.1: The amount of molecules and molecular pairs still having its properties
after the masking. Each row corresponds to one region in the Venn-diagram shown
in Figure 3.2.

Molecules Molecular Pairs
Total 248,346 (100%) 198,558 (100%)
LogD 198,677 (80%) 127,336 (64%)

Solubility 173,843 (70%) 97,586 (49%)
Clearance 149,008 (60%) 71,322 (36%)

LogD and Solubility 139,143 (56%) 62,637 (31%)
LogD and Clearance 119,162 (48%) 45,614 (23%)

Solubility and Clearance 104,213 (42%) 34,988 (18%)
LogD, Solubility and Clearance 83,349 (34%) 22,324 (11%)

3.3.2 Difficulty Assessment
An important part of designing a curriculum is the difficulty assessment. In this
project we will explore three different kinds: Property-Based, Length-Based
and Token Rarity-Based. We will also compare these to a trivial, random as-
sessment which we will call Random-Based, and an assessment where we flip the
order of a previous assessment, called Reverse.

Exploratory Difficulty Assessments:
By Exploratory Difficulty Assessments, we mean the following 3 difficulty as-
sessments.

• Property-Based - This difficulty assessment will use the fact that we now
have masked some of our property values (see above) and assigns the difficulty

20


3. Methods

for an input-output pair as the number of available properties, which is an in-
teger between 0 and 3. This difficulty assessment is based on the assumption
that a bigger number of available property change tokens are harder to learn.

• Length-Based - This difficulty assessment assigns the difficulty for an input-
output pair as the number of tokens that the input molecule’s SMILES-
representation constitutes of, which is an integer between 10 and 77 in our
dataset. This difficulty assessment is based on the hypothesis that longer se-
quences are harder to learn.

• Token Rarity-Based - This difficulty assessment assigns the difficulty for
an input-output pair first by listing all SMILES related tokens, sorted by fre-
quency in the training set, meaning that the frequent tokens will be first and
rare tokens will be last. An input-output pair will then get its difficulty score
as the position in the sorted list of its most rare token, which is an integer
between 6 and 30 in our dataset. This difficulty assessment is based on the
hypothesis that inputs with more rare tokens are harder to learn.

Comparative Difficulty Assessments:
By Comparative Difficulty Assessments, we mean either the Random-Based
difficulty assessment or any assessment defined as a Reverse. In this project, we
only apply Reverse on the previously mentioned Property-based difficulty as-
sessment.

• Random-Based - This difficulty assessment assigns the difficulty for an input-
output pair randomly. This difficulty assessment will only be used for com-
parison purposes.

• Reverse - For each difficulty assessment, one can also define its Reverse
assessment, which flips the list ordered by difficulty upside-down, taking the
easy input-output pairs as difficult, and vice versa. This will also only be used
for comparison purposes. If the Exploratory difficulty assessments are indeed
good ones, then applying Reverse on one of them should result in a bad diffi-
culty assessment.

3.3.3 Curriculum Arrangement
After the difficulty assessment, the choice of how many curriculum learning steps
and how much data should be in each step needs to be made. For Property-Based as-
sessment, the division that is used follows naturally from the difficulty assessment: 4
steps are used, and the data in step i will be the data that has i properties available.

For Length-Based and Token Rarity-Based assessment, a natural division would
also be to have one step for each difficulty level, which would be 66 and 24 steps
respectively, as were seen in the previous section. But to make our different methods

21


3. Methods

more comparable, we use 4 steps here as well. For Length-Based, Token Rarity-
Based and Random-Based assessment, the data is spread uniformly in the buckets,
meaning that all buckets have approximately equally many input-output pairs. This
is done by firstly splitting the pairs into the quartiles (which there are exactly 4 of)
based on the score given by the chosen difficulty assessment, and secondly assigning
the first bucket the first quartile, the second bucket the second quartile and so on.
The reason for it being approximately uniform is that the split is not made exactly
at the quartiles, but at the closest place in the list where the difficulty score changes.
For Length- and Token Rarity-Based difficulty, using 8 steps will also be considered
with the same percentile-based data split as for 4 steps. Figure 3.3 shows how the
data is split up based on difficulty scores for three of the difficulty assessments.

0 1 2 3
0

10000

20000

30000

40000

50000

60000

70000

80000

Co
un

t

C1 C2 C3 C4

Number of Available Properties

(a) Property-Based

10 20 30 40 50 60 70
0

1000

2000

3000

4000

5000

6000

7000

Co
un

t
C1 C2 C3 C4

SMILES Length

(b) Length-Based

S s
[n
H] Cl o # 4 F 3 - = 5 n Br

[N
+]
I 6 N

[n
+] 7 2 O [O
]

[N
H+

]
[S
H] [S
-]

0

2500

5000

7500

10000

12500

15000

17500

20000

Co
un
t

C1 C2 C3 C4
SMILES Rarest Token

(c) Token-Based

Figure 3.3: Training data division for Property-, Length- and Token Rarity based
difficulties with 4 buckets. In each graph Ci represents bucket i.

3.3.4 Baseline
For curriculum learning, one baseline will be considered. The baseline will be the
Transformer model described in [1], except that it is trained on the masked data de-
scribed in Section 3.3.1. This will be compared to the curriculum learning approach,
where more data is introduced in steps. The idea with the baseline is to investigate
whether curriculum learning benefits the training.

22


3. Methods

3.3.5 Training and Validation
Through curriculum learning the training is structured in steps according to the
various difficulties (curricula) presented in Section 3.3.2. For each curriculum learn-
ing step the model will be trained for a predefined number of epochs, which could
be considered as hyperparameters. As for the original Transformer for molecular
optimization we will consider the training through minimization of the Negative-
Log-Likelihood, described in Section 2.2. Note, however, that unlike the original
model, the used training set will depend on the current curriculum learning step.

To measure the models’ improvement over training, a common set of molecular pairs
designated for validation (Section 3.2.1) is used. Among all the 17,871 molecular
pairs aimed for validation, only the ones with (simulated) three properties are con-
sidered, resulting to a size of 1,984. This is due to the general aim of generating
molecules based on the three property changes, as was noted in Section 3.3.2.

For the validation set, the Negative-Log-Likelihood loss and the validation accuracy
are calculated after each epoch. The validation accuracy represents the proportion
of molecules generated from the validation set using greedy decode, described in
Section 2.1.6, that are equal to the (true) validation target molecules. Since the
validation- and training sets are expected to posses similar data qualities by con-
struction, and as the true target molecules fulfill the desirable molecular properties,
these validation measures are reasonable.

For the model- and training hyperparameters we will use the same ones as presented
in Section 2.2 for the original Transformer for molecular optimization. This is done
since it is reasoned that both models work in a similar input- and output space and
could be expected to take similar benefits from the specific model architecture.

3.3.6 Test Sets
To evaluate the models’ performances in different relevant scenarios, we will use
a few different test sets. Since we are interested in the models’ performances on
molecules with three available properties, we will only use the part of the test data
for which the pairs have been simulated to have three properties. This means that
the constructed test sets will only use a subset of the 19,856 pairs designated for
testing purposes described in Section 3.2.1. During evaluation we will use the test
sets below.

Test-Original* - This test set constitutes of all molecular pairs with three proper-
ties used for testing purpose, corresponding to a size of 2,229 pairs. For this test set
we will use the source molecules, as well as the property change tokens correspond-
ing to the target molecules in the test set.

Test-Molecule* - This test set contains all molecular pairs present in Test-Original*
except for the pairs for which the source molecule is seen in the train set. This con-
struction yields a test size of 1,393 molecular pairs. The purpose of this test set is

23


3. Methods

to verify how well the models generalize on unseen starting molecules.

Test-Property* - This test set constitutes of 912 starting molecules with low Sol-
ubility, high Clearance, and LogD between 2 and 4.4 in Test-Original*. For this test
set we are interested in achieving lower LogD, high Solubility and low Clearance.
The desirable LogD for the target molecules are constrained to lie in the interval 1.0
to 3.4, as the experimental data for which the internal prediction model was trained
on, lies in this range. The purpose of Test-Property* is to evaluate the models on
how well they perform on the particular property changes we are interested in. Note
that this specific property change only corresponds to about 0.2% of the molecular
pairs in the training data.

3.3.7 Evaluation Metrics

For comparing the curriculum learning models and their baseline we will look at a
few evaluation metrics, highly based on the metrics considered in [1]. The metrics
are based on the performance of the generated molecules from each model. The gen-
erated molecules come from letting each model generate up to 10 unique and valid
SMILES molecules for each source molecule in the test sets. To generate 10 unique
and valid SMILES molecules, multinomial decode (Section 2.1.6) is used with up to
100 trials. For the generated molecules, the following measures will be used:

Desirable - This metric gives the proportion of generated molecules that fulfill the
desirable properties. Since the goal of the model is to generate molecules with the
desirable property values specified as input, this is the most important metric. It
gives a number that reflects how good the model is at generating these molecules,
and therefore a high Desirable value is preferred.

MMP33 - This refers to the proportion of generated molecules for which the ratio
between the number of heavy atoms in the R-group and the number of heavy atoms
in the entire molecule is less than 0.33. From Section 3.2.1 it is recognized that
the training data was constructed in such a way that all matched-molecular-pairs
satisfy this. This evaluation metric will give a measure of how well the models have
learned to model the data used during training, and therefore a high MMP33 value
is preferred.

Novel Transformations - This refers to the proportion of generated molecules
which yields a transformation (i.e. specific R-group change) which has not been
seen in the training data. This metric will give an idea of the models’ generalization
performance. Note that the performance interpretation of this metric will depend
on the corresponding Desirable and MMP33 values, e.g. many novel transformations
are not preferable if a model scores bad in terms of Desirable.

24


3. Methods

3.4 Core-Fixed Molecular Optimization
In the following sections the approach to accomplishing the Core-Fixed molecular
optimization model is introduced. This includes how the input and output to the un-
derlying Transformer model are represented, and how the resulting model is trained
and evaluated.

3.4.1 Input and Output Representations
In Core-Fixed molecular optimization we are training a Transformer model to gen-
erate the SMILES of the R-group of the target molecule, rather than the whole
target SMILES as was seen for the standard Transformer model in [1]. For the
Core-Fixed model to distinguish between the core and the R-group of the source
sequence, we will modify the input sequence compared to the standard formulation.
The input sequence for the Core-Fixed model will consist of the property constraint
tokens, the core and the R-group as well as a separator token between the core
and the R-group. This can be compared to the standard model [1] which use the
property change tokens concatenated with the source SMILES as the input sequence.

The separator token that we will use for the Core-Fixed model will be the dot (“.”)
- symbol. The entire input and output chain of the Core-Fixed model is visualized
in Figure 3.4, which, for easy comparison, uses the same example molecule as was
seen for the corresponding visualization of the standard formulation in Figure 2.2.

3.4.2 Baselines
For the Core-Fixed model we will consider two baselines:

Baseline - For our main baseline we will consider the model developed by [1] that
was trained to generate the entire target molecules at once, in contrast to the Core-
Fixed formulation.

Enumeration Baseline - This constitutes of an exhaustive algorithm, that, for a
test molecule, consists of first enumerating over the seen R-groups in the training
data and attaching each to the molecule’s core, secondly, selecting all found R-groups
which yielded a molecule with desirable properties. This is seen in Algorithm 1.
Note that the Enumeration Baseline is not strictly applicable for comparison to the
original Transformer model for molecular optimization since for the original model,
the concept of cores and R-groups had not been introduced.

3.4.3 Training and Validation
For the Core-Fixed model we will consider a similar training procedure as for the
original Transformer [1], through minimization of the Negative-Log-Likelihood loss,
described in Section 2.2. Note, however, that the meaning of what is minimized
is different since the target sequence for the Core-Fixed model will represent the

25


3. Methods

+

+

(From Source)

Input: Source Sequence

Property Constraint SeparatorCore SMILES Source R-Group SMILES

LogD_Change_(-1.1, -0.9] Solubility_low->high CLlint_high->low

[*:1]c1ccc2cc(NC(=O)C3CC3)ncc2c1 . [*:1]C1CCCC1

=

Transformer

Output: Target R-Group SMILES

[*:1]c1cncnc1

Source Molecule:
(CHEMBL3685029)

Target Molecule:
(CHEMBL3685059)

Figure 3.4: Input and output chain of the Core-Fixed model. The input consists
of property change tokens, the SMILES of the core, the SMILES of the source R-
group and the dot (“.”) -symbol separating the core and R-group representations.
The output consists of a single R-group, which, when merged with the core from
the source molecule, forms the target molecule.

SMILES of the target R-group, as opposed to the SMILES of the entire target
molecule as for the original Transformer for molecular optimization.

As with curriculum learning, the validation set will be used to measure model im-
provement when training through the Negative-Log-Likelihood loss. For the Core-
Fixed model, the validation accuracy represents the proportion of generated R-
groups for the validation set using greedy decode, described in Section 2.1.6, that
are equal to the (true) validation target R-groups.

During training we will also consider other hyperparameter choices than what was

26


3. Methods

Algorithm 1 Exhaustive algorithm used to find which of the known R-groups that,
when attached to the core c, yields a molecule that fulfills the desired properties p.

1: procedure EnumerationBaseline(c, p)
2: (c - core SMILES from the test set)
3: (p - corresponding desirable properties)
4: Let RT = {All unique R-groups in train set}
5: Let Vc = ∅ . Set of R-groups yielding desirable properties
6: for r ∈ RT do
7: Let M = Concat(c, r) . Concatenate core and R-group
8: if M fulfills p And HeavyAtomsRatio(r, c)<0.33 then
9: Let Vc = Vc ∪ {r} . Add r to Vc

return Vc

used for the original Transformer model described in Section 2.2. Since the output
of Core-Fixed model represents a different molecular space, i.e. the space of R-
group SMILES as opposed to the space of entire molecules’ SMILES, an optimal
model would not necessarily benefit from using the same hyperparameters. R-group
SMILES are shorter by construction, but the relationships between the SMILES
tokens might also be more complex for a model to learn. With these facts it is hard
to get a prior understanding whether the Core-Fixed model would benefit from
having a more complex or simpler architecture than the original model [1].

3.4.4 Test Sets
To evaluate the Core-Fixed model’s performance in different relevant scenarios, we
will use a few different test sets.

Test-Original - This constitutes of all molecular pairs used for testing purpose,
corresponding to a size of 19,856 pairs. With this construction we see that Test-
Original will have the same molecular space as the training and validation sets. For
this test set we will use the source molecules, as well as the property change tokens
corresponding to the target molecules in the test set.

Test-Core - This is a subset of the molecular pairs in Test-Original where we have
excluded molecular pairs for which the core is present in the training set, yielding a
test size of 6,136. When working with the Core-Fixed model, the purpose of Test-
Core will be to see how well the models generalize on unseen cores.

Test-Property - This set consists of 7,813 starting molecules with low Solubility,
high Clearance, and LogD between 2 and 4.4 in Test-Original. For this test set
we are interested in achieving lower LogD, high Solubility and low Clearance. The
desirable LogD for the target molecules are constrained to lie in the interval 1.0 to
3.4, as the experimental data for which the internal property prediction model was
trained on, lies in this range. The purpose of Test-Property is to evaluate the models
on how well they perform on the particular property changes we are interested in.
As was noted in the section of curriculum learning, this specific property change is

27


3. Methods

only used by 0.2% of the molecular pairs in the training data.

3.4.5 Evaluation Metrics
The evaluation of the Core-Fixed model and its baselines will be similar to what
was presented for curriculum learning in Section 3.3.7. Besides those evaluation
metrics, i.e. Desirable, Novel Transformations and MMP33, we will in the case of
the Core-Fixed model also consider the following:

Novel R-groups - This metric gives the proportion of generated molecules that
contain R-groups which are not seen among the R-groups in the training data.
This metric will give an idea of models’ generalization performance. As for Novel
Transformations, the performance interpretation will depend on the corresponding
Desirable value.

Unchanged Core - This refers to the proportion of generated molecules that keep
the core specified by model input. This metric is an approach to verifying how well
the original Transformer for molecular optimization (Baseline) compares to the Core-
Fixed model and the Enumeration Baseline, since the corresponding proportions for
these will be 100% by constructions. Here a higher value is preferred.

28


4
Results

This chapter goes through the results for the two extensions of the previous Trans-
former for molecular optimization; curriculum learning and Core-Fixed formulation.

4.1 Curriculum Learning
In the following sections the results for the curriculum learning Transformer models
are presented. The chapter begins by comparing the different difficulty assessments
that were discussed in Section 3.3.2. This will be done firstly by looking at the
model’s performance and development during the training and Validation. After
this, the Evaluation Metrics will also be used to compare them in terms of generated
molecules. Furthermore, a comparison of the computational time will be done.
Throughout the chapter, Baseline refers to the Transformer model trained without
the use of curriculum learning as in [1].

4.1.1 Comparison of Exploratory Difficulty Assessments
This section compares models trained using the different difficulty assessments. To
make the results easier to interpret, they have been split up in a part covering the
Exploratory difficulty assessments, which are our main results, and one covering
a comparison of the Property-Based difficulty assessment with the Comparative
difficulty assessments. This part covers the Exploratory difficulty assessments.

4.1.1.1 Training and Validation

In this section, the training and validation loss curves for the Exploratory difficulty
assessments are shown. For the choice of number of epochs to train in each step
for each model, a few different choices where tested and the best one of them was
chosen. See Appendix A.1 for more information on this.

In Figure 4.1 we see that training loss is very similar for all models. Looking at
the validation loss we see that the Exploratory models differ from the baseline in
the early epochs, only to later sync up more. The reason for this could be that the
Exploratory models only have access to easy subsets of the data during the early
epochs, which makes them perform worse on the validation set, which contains both
easy and difficult data. For all curves we see a slight upward tendency in the later
part of the training, indicating that overfitting is occurring there.

29


4. Results

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Training Hour

0.0

0.5

1.0

1.5

2.0

2.5

Ne
ga

tiv
e-
Lo
g-
Lik

el
ih
oo

d 
(T
ra
in
)

Property (20,20,20,120)
Token (5,10,15,150)
Length (20,20,20,120)
Baseline

(a) Training loss over training time

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Training Hour

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Ne
ga

tiv
e-

Lo
g-

Lik
el

ih
oo

d 
(V

al
id

at
io

n)

Property (20,20,20,120)
Token (5,10,15,150)
Length (20,20,20,120)
Baseline

(b) Validation loss over training time

Figure 4.1: Training and validation loss over training time for the Exploratory dif-
ficulty assessments in curriculum learning. The numbers in the parenthesis represent
the number of training epochs for each curriculum learning step.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Training Hour

0%

1%

2%

3%

4%

5%

Va
lid

at
io

n 
Ac

cu
ra

cy

Property (20,20,20,120)
Token (5,10,15,150)
Length (20,20,20,120)
Baseline

(a) Validation accuracy over
training time

0 25 50 75 100 125 150 175
Epoch

0%

1%

2%

3%

4%

(S
m
oo
th
ed
) V

al
id
at
io
n 
Ac
cu
ra
cy

Property (20,20,20,120)
Token (5,10,15,150)
Length (20,20,20,120)
Baseline

(b) (Smoothed) validation accuracy over
epoch

Figure 4.2: Validation accuracy over training time (a) and smoothed validation
accuracy over epochs (b) for the Exploratory difficulty assessments in curriculum
learning. The validation accuracy is defined in Section 3.3.5. The numbers in the
parenthesis represent the number of training epochs for each curriculum learning
step.

In Figure 4.2 we see the validation accuracy for the Exploratory difficulty assess-
ments over (a) training time and (b) epochs. The validation accuracy that the model
reaches can be seen as an indicator for how good the model will be at generating new
molecules with desirable properties, which is the main goal of the model. In Figure
4.2(a) we see that there seems to be no significant difference in the training time
of the models. If anything, Property is slower to converge than the rest, including
the baseline. similarly there seems to be no significant difference in what value the
validation accuracy reaches, indicating that the models will perform the same on the
test sets. In Figure 4.2(b) we see similar information, but we see it in terms of epochs
instead. We see that the Exploratory models need more epochs to converge, which

30


4. Results

is not surprising as previously mentioned, because the early epochs contain less data.

For model selection, i.e. the choice of epoch number to be used for evaluation
on the test set, a trade-off between high smoothed validation accuracy and low
validation loss, was considered. Here, the smoothed validation accuracy refers to
the running average of the validation accuracy. Through this trade-off, combined
with a subjective touch, the epochs where chosen to be the following:

• Property: 169
• Token: 153
• Length: 120
• Baseline: 111

4.1.1.2 Sample-Based Molecule Generation

In this section the Evaluation Metrics are given for the Exploratory difficulty as-
sessments.

Table 4.1: Comparison of the generation performance of the Explorative difficulty
assessments (DA) and the corresponding baseline on the three test sets when using
multinomial decode. See Section 3.3.7 for definitions of the evaluation metrics.

DA Test-Original* Test-Molecule* Test-Property*
Desirable CL Transformer Prop. 48.44% 48.42% 35.45%

CL Transformer Token 50.87% 50.29% 39.12%
CL Transformer Length 50.41% 51.55% 40.95%

Baseline 51.81% 51.34% 37.82%
MMP33 CL Transformer Prop. 89.42% 87.29% 89.21%

CL Transformer Token 90.40% 88.61% 89.83%
CL Transformer Length 89.71% 88.26% 90.22%

Baseline 90.15% 88.28% 89.64%
Novel Trans. CL Transformer Prop. 51.18% 43.56% 57.31%

CL Transformer Token 49.18% 41.85% 58.28%
CL Transformer Length 50.90% 43.73% 58.36%

Baseline 50.79% 43.11% 56.87%

In Table 4.1 we see the Evaluation Metrics for the Exploratory models. Looking
at the Desirable metric, we see that Length performs the best, getting the high-
est percentage on Test-Molecule* and Test-Property*. In the same sense, Property
performs the worst. The differences are small, however, and might not be significant.

Looking at MMP33, which gives the proportion of generated molecules for which the
number of heavy atoms in the R-group is less than 0.33 of the number of heavy atoms
in the entire molecule, we see that Token has the highest values on Test-Original*
and Test-Molecule*, which means that it has learned to model the training data
better. This difference is however also small, meaning it might not be significant.

31


4. Results

For Novel Transformations, which is the proportion of generated molecules that
yield a transformation (i.e. specific R-group change) which has not been seen in
the training data, we see that Length gets the highest score on Test-Molecule* and
Test-Property*, which means that Length generates many new molecules and there-
fore might be good at extrapolation. The differences are, however, small here as well.

For Test-Molecule*, we see that the models have similar Desirable and MMP33 val-
ues as to Test-Original*, which indicates that the models perform well on unseen
molecules. The Novel Transformations is, however, lower on Test-Molecule*, which
shows that the models use more known transformations on unseen molecules.

Finally, for Test-Property* the Desirable values are considerably lower than for the
other two test-sets. Considering the low proportion of molecular pairs that have
these associated property changes, it might not come as a surprise that the models
score worse here.

4.1.2 Comparison of Property-Based with Comparative Dif-
ficulty Assessments

In this section we look at how the Property-based difficulty assessments compare to
the two Comparative difficulty assessments Random-Based and Reverse (applied on
Property-Based).

4.1.2.1 Training and validation

In this section, the training and validation loss curves for the Property-based dif-
ficulty assessment and for the Comparative difficulty assessments are shown. For
the choice of number of epochs to train in each step for each model, a few different
choices where tested and the best one of them was chosen, see Appendix A.1 for
more information on that.

In Figure 4.3 we see similar curves as in the Exploratory case. We see again that
training loss is very similar for all models and the the validation loss for the Com-
parative models differ from the baseline in the early epochs. Again, the reason for
this could be that the Comparative models only have access to easy data at that
point. We also see the same slight upward tendency in the later part of the training,
indicating overfitting.

In Figure 4.4 we see the validation accuracy for the Comparative difficulty assess-
ments over training time (a) and epochs (b). In Figure 4.4(a) we see, similarly to
the Exploratory case, that there seems to be no significant difference in the train-
ing time of the models. If anything, Property is slower than the rest here as well.
Similarly there seems to be no significant difference in what value the validation
accuracy reaches, indicating that the models will perform the same on the test sets.
In Figure 4.4(b) we see similar information, but we see it in terms of epochs in-
stead. We see that the Comparative models need more epochs to converge, which

32


4. Results

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Traning Hour

0.0

0.5

1.0

1.5

2.0

2.5

Ne
ga

tiv
e-

Lo
g-

Lik
el

ih
oo

d 
(T

ra
in

)

Property (20,20,20,120)
Reversed Property (20,20,20,120)
Random (20,20,20,120)
Baseline

(a) Training loss over training time

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Traning Hour

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Ne
ga

tiv
e-
Lo
g-
Lik

el
ih
oo
d 
(V
al
id
at
io
n)

Property (20,20,20,120)
Reversed Property (20,20,20,120)
Random (20,20,20,120)
Baseline

(b) Validation loss over training time

Figure 4.3: Training and validation loss over training time for the Comparative
difficulty assessments in curriculum learning. The numbers in the parenthesis rep-
resent the number of training epochs for each curriculum learning step.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Training Hour

0%

1%

2%

3%

4%

5%

Va
lid

at
io
n 
Ac

cu
ra
cy

Property (20,20,20,120)
Reversed Property (20,20,20,120)
Random (20,20,20,120)
Baseline

(a) Validation accuracy over training time

0 25 50 75 100 125 150 175
Epoch

0%

1%

2%

3%

4%

(S
m

oo
th

ed
) V

al
id

at
io

n 
Ac

cu
ra

cy

Property (20,20,20,120)
Reversed Property (20,20,20,120)
Random (20,20,20,120)
Baseline

(b) (Smoothed) validation accuracy over
epoch

Figure 4.4: Validation accuracy over training time (a) and smoothed validation
accuracy over epochs (b) for the Comparative difficulty assessments in curriculum
learning. The validation accuracy is defined in Section 3.3.5. The numbers in the
parenthesis represent the number of training epochs for each curriculum learning
step.

is not surprising as previously mentioned, because the early epochs contain less data.

As with the Explorative case, the model selection, i.e. the choice of epoch number
to be used for evaluation on the test set, a trade off between high smoothed validation
accuracy and low validation loss, was considered. Through this trade off combined
with a subjective touch, the epochs where chosen to be the following:

• Random: 158
• Reverse Property: 149

33


4. Results

4.1.2.2 Sample-Based Molecule Generation

Table 4.2: Comparison of the generation performance of the Comparative difficulty
assessments (DA) and the corresponding baseline on the three (masked) test sets.
See Section 3.3.7 for definitions of the evaluation metrics.

DA Test-Original* Test-Molecule* Test-Property*
Desirable CL Transformer Prop. 48.44% 48.42% 35.45%

CL Transformer Rev. Prop. 50.77% 50.39% 37.01%
CL Transformer Random 49.23% 48.54% 37.20%

Baseline 51.81% 51.34% 37.83%
MMP33 CL Transformer Prop. 89.42% 87.29% 89.21%

CL Transformer Rev. Prop. 89.49% 87.35% 88.63%
CL Transformer Random 88.61% 86.44% 88.24%

Baseline 90.15% 88.28% 89.64%
Novel Trans. CL Transformer Prop. 51.18% 43.56% 57.31%

CL Transformer Rev. Prop. 50.50% 42.63% 57.35%
CL Transformer Random 50.89% 43.34% 57.63%

Baseline 50.79% 43.11% 56.87%

In Table 4.2 we see the Evaluation Metrics for the Comparative models. Looking
at the Desirable metric we see that the baseline performs the best, getting a higher
percentage on all of tree test sets. Furthermore, we note that the models using Re-
versed Property- and Random-based difficulty assessments get a higher percentage
on all of the test sets compared to Property.

Next we identify MMP33, which gives the proportion of generated molecules for
which the ratio between the number of heavy atoms in the R-group and the number
of heavy atoms in the entire molecule is less than 0.33. We see that the baseline
has the highest values on all of the three test sets, indicating that the curriculum
learning models have not learned the data as good as the baseline has.

For Novel Transformations, we see that the Property-Based gets the highest per-
centage on two of the three test sets, which also seem to correlate with it having the
lowest Desirable values.

4.1.3 Computational Time

In Figure 4.5, the training time is presented for each of the models in the previous
sections. The training time is defined as the time required to get to the respective
epoch for which model has converged according to the previously introduced criteria
(model selection).

34


4. Results

Property Token Length Rev. Prop. Random Baseline
0

2

4

6

8

10

12

14

16

Tr
ai

ni
ng

 H
ou

rs

Figure 4.5: Training hours required for each of the Exploratory and Comparative
models, including the baseline.

In Figure 4.5 it is seen that the training time vary significantly between the models.
Most of the models are slower than the baseline, with only Length being faster.
The Property based difficulty assessment required the most training time, while the
Length based the least.

4.2 Core-Fixed Molecular Optimization
In the following sections the results for the Core-Fixed model are presented. The
chapter begins with showing the model’s performance and development on the train-
ing and validation set. This is followed by results showing the molecule generation
performance on the test sets. If not explicitly stated otherwise, the models use mul-
tionomial decode to generate 10 unique and valid molecules. Furthermore, a com-
parison of the computational time will be done. Through-out the chapter, Baseline
refers to the Transformer model trained to generate molecules based on their en-
tire SMILES representation, described in Section 2.2, as opposed to the Core-Fixed
formulation.

4.2.1 Training and Validation
In Figure 4.6(a) the train and validation losses are presented over training epoch
when using the same hyperparameters as for the original formulation [1].
It is seen that the model seems to suffer from overfitting which suggests looking
for other hyperparameters to improve its performance. Appendix B.2 presents the
corresponding losses for other hyperparameter choices.

In Figure 4.6(b) the change in validation accuracy over epochs is shown for the orig-
inal Transformer model for molecular optimization and the Core-Fixed model with
the same hyperparameters.

It is shown that the Core-Fixed model continues to improve in terms of validation
accuracy until approximately epoch 38. Furthermore, Figure 4.6(b) also shows the
Baseline (original Transformer model for molecular optimization) obtains a signifi-

35


4. Results

0 10 20 30 40 50 60
Epoch

0.2

0.4

0.6

0.8

1.0

1.2

Ne
ga

tiv
e-
Lo

g-
Lik

el
ih
oo

d

CF Validation
CF Train
Baseline Validation
Baseline Train

(a) Train and validation loss

0 10 20 30 40 50 60
Epoch

0%

2%

4%

6%

8%

10%

12%

Va
lid
at
io
n 
Ac
cu
ra
cy

Epoch 38

CF Transformer
Baseline

(b) Validation accuracy

Figure 4.6: (a) Train and validation loss over epochs for the Core-Fixed model. (b)
Validation accuracy over epochs for the original Transformer model for molecular
optimization (Baseline) and the Core-Fixed model with the same hyperparameters.
The validation accuracy is defined in Section 3.4.3.

cantly lower validation accuracy. In Appendix B.2 corresponding results are shown
for other hyperparameter choices. Through a trade off between validation loss and
validation accuracy the Core-Fixed model with the same hyperparameters as the
original Transformer model for molecular optimization, was chosen for testing pur-
poses. Using the same trade off between validation loss and validation accuracy, the
number of epochs was chosen to be 38.

4.2.2 Deterministic Molecule Generation
When generating molecules during the validation procedure while training, it was
seen that the Core-Fixed model yielded a significantly higher validation accuracy
than its baseline. With this in mind, it is reasonable to look at how the same
molecule generation procedure would act on the test data. Table 4.3 shows the pro-
portion of generated molecules using greedy decode that fulfill the desirable prop-
erties along with the proportion of generated molecules that are the same as the
corresponding target molecules.

Table 4.3: Comparison of the Core-Fixed model and its Baseline, when using
greedy decode with one generated molecule per starting molecule, on Test-Original.
Top One Desirable refers to the percentage of generated molecules that fulfill the
desired properties while Top One Accuracy represents the percentage of generated
molecules that are the same as their corresponding targets.

Top One Desirable Top One Accuracy
CF Transformer 70.19% 12.47%

Baseline 65.21% 5.00%

In Table 4.3 it is seen that the Core-Fixed model has a significantly higher ability to
generate molecules with desirable properties. It is also seen that the proportion of

36


4. Results

generated molecules that are the same as the true target molecule is approximately
equal to the validation accuracies (see Figure 4.6(b)) which could be expected since
Test-Original and the validation set are sampled similarly from the original data.

4.2.3 Sample-Based Molecule Generation
This section presents the main results for the Core-Fixed model and how it compares
to its two baselines; the original Transformer for molecular optimization, and the
Enumeration Baseline.

Table 4.4: Comparison of the performance of the Core-Fixed model and its cor-
responding baselines on the three test sets when using multinomial decode. See
sections 3.3.7 and 3.4.5 for definitions of the evaluation metrics.

Test-Original Test-Core Test-Property
Desirable CF Transformer 58.97% 56.76% 42.90%

Baseline 56.14% 55.61% 41.75%
Enum. Baseline 16.93% 18.64 % 15.91%

MMP33 CF Transformer 97.67% 97.42% 97.57%
Baseline 90.45% 86.82% 90.69%

Enum. Baseline 77.85% 77.93% 81.19%
Novel Trans. CF Transformer 53.92% 32.37% 57.84%

Baseline 51.31% 34.76% 57.98%
Enum. Baseline 96.62% 98.36% 96.65%

Novel R-groups CF Transformer 4.30% 2.14% 4.66%
Baseline 3.99% 2.27% 4.25%

Enum. Baseline 0.00% 0.00% 0.00%
Unchanged Core CF Transformer 100.00% 100.00% 100.00%

Baseline 69.10% 44.60% 62.25%
Enum. Baseline 100.00% 100.00% 100.00%

Table 4.4 presents various statistics of the generated molecules for the three consid-
ered test sets and the three models. Looking at the Desirable metric, it is seen that
the Core-Fixed model manages to outperform the original Transformer for molecu-
lar optimization by 1-3 percent units depending on the test set. For the number of
heavy atoms in the R-group compared to that of the entire molecule, seen through
MMP33, Table 4.4 shows a clear advantage in using the Core-Fixed model.

For the proportion of generated molecules that obtain a transformation or R-group
not seen in the training data (Novel R-groups/Transformations), the Core-Fixed
model and the original Transformer for molecular optimization seem to obtain sim-
ilar performances, with differences depending on the test set. In Table 4.4, it is
shown that the two Transformer-based models yield lower proportion of generated
R-groups which are not in the training data, for Test-Core than for the other two
test sets. For the Enumeration Baseline no novel R-groups are generated, which
is expected by design of the associated algorithm, while a high proportion of novel

37


4. Results

transformations are seen for the three test sets.

One of the main differences between the Core-Fixed model and the original Trans-
former for molecular optimization could be seen in Table 4.4 through the proportion
of generated molecules which keep the core from the input molecules. For the Core-
Fixed model and the Enumeration Baseline all generated molecules keep the core by
constructions, while for the original Transformer for molecular optimization the cor-
responding proportion is considerably lower. This is in particular true for Test-Core
containing input molecules for which the core has not been seen in the training data.

Finally, to get an idea of how the distributions of how many generated molecules
that fulfill the desired properties for each starting molecule for the Transformer based
models, we refer to Figure 4.7. As seen in Figure 4.7 the distributions are similar but
with a slight advantage to the Core-Fixed model in terms of mean for Test-Original
and Test-Core while the advantage on Test-Property is not as clear. Note that the
differences in distributions seen in Figure 4.7 are statistically significant when using
K-Sample Anderson Darling test (See Appendix C) at significance level 0.1%, but as
the data sizes are big this itself might not serve as a strong indication of difference.

38


4. Results

(a) (Left) Distributions of number of generated molecules with de-
sirable properties per source molecule in Test-Original, (Right) cor-
responding box plots.

(b) (Left) Distributions of number of generated molecules with de-
sirable properties per source molecule in Test-Core, (Right) corre-
sponding box plots.

(c) (Left) Distributions of number of generated molecules with de-
sirable properties per source molecule in Test-Property, (Right) cor-
responding box plots.

Figure 4.7: Number of generated molecules with desirable properties per source
molecule when using the Core-Fixed model and its associated Baseline.

39


4. Results

4.2.4 Novel R-Group Samples

Figure 4.8 shows the top 10 most common novel R-groups generated by the Core-
Fixed model when generating based on Test-Original using multinomial decode.

Figure 4.8: Top 10 most frequent novel R-groups found by the Core-Fixed model
when generating using multinomial decode, on Test-Original.

4.2.5 Example of Baseline’s Inability to Keep the Core

As was seen in Table 4.4, a significant proportion of the generations from the Baseline
failed to keep the core. In Figure 4.9 an example of such a source molecule is
presented.

Figure 4.9: Example of a source molecule (CHEMBL2409118) for which the base-
line failed to keep the core for the 10 generated molecules.

In Figure 4.10, five of the corresponding generated molecules using the Core-Fixed
model and the Baseline respectively, are presented. It is seen that the Core-Fixed
model succeeds in generating molecules where the core, seen in Figure 4.9, is kept.
For the generations using the Baseline, however, it is seen that other parts of the
source molecule are changed.

40


4. Results

(a) Generations from the Baseline. The molecule parts surrounded by the red lines
represent the erroneous transformations

(b) Generations from the Core-Fixed model

Figure 4.10: Generations based on the source molecule in Figure 4.9 for which
the Baseline failed to keep the core. Note that all of the shown molecules fulfill the
desirable properties LogD, Solubility and Clearance.

4.2.6 Computational Time
In Table 4.5 the training and generation times for the Core-Fixed model and its
baselines are shown.

Table 4.5: Comparison of run-times for Core-Fixed model and corresponding base-
lines. For Generation, the Test-Original data set consisting of 19,856 molecule pairs,
was used. Note that the generation time for the Enumeration Baseline is based on
running the algorithm on 32 CPU:s. With more parallelization, a linear speed up
in the number of CPU:s is theoretically achievable.

Train Generation
CF Transformer 7h 37m 7h 14m

Baseline 13h 8m 4h 47m
Enumeration Baseline (Not Applicable) 32h 23m

It is seen that since the Core-Fixed model requires less training epochs, 38 as opposed
to 60, the training time is decreased as compared to the original Transformer for
molecular optimization. For generating molecules the Enumeration Baseline is slow
and takes significantly longer time to finish generating molecules for Test-Original.
In Table 4.5 it is seen that the generation time for the Core-Fixed model is slower
than that of the original Transformer for molecular optimization.

41


4. Results

42


5
Discussion

Following sections show a discussion among the results presented in the previous
chapter and topics that can be explored outside the scope of this project.

5.1 Curriculum learning
In this section we discuss the results for curriculum learning concerning training and
validation, and molecule generation. Based on these we suggest topics for future
work.

5.1.1 Training and Validation
During training of the models, we saw that all of the models performed very similar,
both in terms of training speed and in terms of validation accuracy. This suggests
that curriculum learning, at least in the way it has been used in this project, does
not significantly help our model. Furthermore, similar conclusions can be drawn
from the Evaluation metrics, especially Desirable, which also showed similar values
in all the models.

There are, however, slight differences in the different models, which are worth men-
tioning. When comparing the different difficulty assessments, we saw that Property
had lower validation accuracy compared to the others in the lower epochs. A spec-
ulation for why this could be is that the curriculum arrangement for Property was
made according to the available properties, and not by quartiles, making it have less
data in step 1 than the other Exploratory models, as was seen in Figure 3.3. We also
saw that, if any, the baseline and the Length-based difficulty assessment performed
the best, both in terms of training time and in the Desirable evaluation metric.

Concerning the Comparative models: as stated earlier in the report, if curriculum
learning would have helped in getting better results, in any way, we would expect to
see that Property got good results, that Random got mediocre results and Reverse
got bad results. What we saw in the validation accuracy was that there is no signif-
icant difference in the accuracy that they converge to, suggesting again that the use
of curriculum learning is not very influential. If anything, we saw that both Ran-
dom and Reverse actually had slightly higher validation accuracies than Property
during the early epochs. A speculation for why this is could be that the easy data
is less informative by having less properties, less tokens, or less informative tokens,

43


5. Discussion

which means that the model might not learn as much from it. With this in mind,
spending less time training on easy data, and more time training on difficult data,
could improve the overall training time. This also suggests that the difficulty assess-
ment was not beneficial to the model, although we guessed that it would. To make it
work, other difficulty assessments should be tested, hopefully to find one that works.

The choice of the number of epochs to use for each curriculum learning step is both
important and hard do decide. In this study a quite simple method of trial and
error was used to find a good choice, but there could be better ones also. The
choice is important when considering the training time, because for example if the
first curriculum learning step includes more epochs than it takes for the baseline
to converge, then curriculum learning will most likely not be able to get a lower
training time. On the other hand, if the number of steps are too low, the model will
be similar to the baseline, getting all the data (almost) at once. Similarly, the choice
of convergence epoch is also important and hard to decide. It is important because
it also affects the training time in a substantial way. A suggestion of how to choose
the amount of epochs in each step is to make it more dynamical by evaluating (in
some way) how much the model have learned in a step, and from that decide if it
should move on the the next step or not.

5.1.2 Molecule Generation
When generating molecules, the Baseline seems to generally yield better perf