Optimization of Molecular Transformers:
Influence of tokenization schemes

Master’s thesis in Complex Adaptive Systems

KATARINA TRAN

DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021
www.chalmers.se

www.chalmers.se


Master’s thesis 2021

Optimization of Molecular Transformers:
Influence of tokenization schemes

KATARINA TRAN

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2021


Optimization of Molecular Transformers:Influence of tokenization schemes

KATARINA TRAN

© KATARINA TRAN, 2021.

Supervisor at AstraZeneca: Dr. Esben Jannik Bjerrum
Supervisor at Chalmers: Torbjörn Lundh
Examiner: Daniel Midtvedt

Master’s Thesis 2021
Complex adaptive systems
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 0705616622

Cover: a high-level architecture of the Transformer.

Typeset in LATEX, template by Magnus Gustaver
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021

iv


Abstract
Synthesis prediction is applied in the ’make’ phase of the design-make-test-analyze
cycle (DMTA) of drug discovery in order to synthesize small organic molecules. Re-
cently, it was shown that synthesis prediction can be treated as a language transla-
tion task using Transformer models. The input reactants and products are expressed
in the form of Simplified Molecular Input Line Entry System (SMILES) which is a
string based molecular structure format. However, the Transformer model seems to
often make errors when outputting SMILES which contains many copies of the same
character. As example, the most common cases involve chains containing many car-
bons. Moreover, due to the nature of the attention mechanism the training time is
proportional to n2 where n is input sequence length and for large databases and long
SMILES strings, the long training times hampers development. We hypothesised
that by splitting up the SMILES in a strategic way the network would train faster
and perform better in the differentiation task compared to single character base-
line. Experiments with two different tokenization (string segmentation) algorithms
with Byte Pair Encoding (BPE) and n-grams were conducted. Different tokenization
schemes were created from these algorithms by varying the hyperparameters. Short-
ening of the tokenized training sequences by having multiple character tokens should
also shorten training time significantly, since these tokenizations methods also com-
pressed the input data. The experiments were conducted on two different datasets,
one smaller containing 50,037 chemical reactions and a larger with 479,035 reactions.
Contrary to prior belief, for the smaller dataset the performance decreased when to-
kenized according to the algorithms compared to the single character baseline. On
the other hand, the performance for the larger dataset was similar to the baseline
while training time was decrease by almost 40% for the n-gram algorithm tokeniza-
tion and around 30% for the Byte Pair Encoding tokenization. These experiments
show that Byte Pair Encoding and n-gram tokenization did not contribute to an
increase in accuracy for the Transformer network for these two datasets. However,
these methods might still be applicable when data size is large enough in order to
speedup training without affecting accuracy.

Keywords: drug discovery, neural machine translation, Transformer, training time,
data compression, Byte Pair Encoding, n-gram.

v


Acknowledgements

My sincere thanks go to Dr. Esben Jannik Bjerrum, who guided me in this project
and who gave me a lot of good advice that I will take with me. I would also like to
thank all helpful staff at AstraZeneca and Chalmers that I have had the chance to
meet during my thesis work.

Katarina Tran, Gothenburg, August 2021

vi


Contents

1 Introduction 1

2 Theory 2
2.1 SMILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Architecture of the classic Transformer . . . . . . . . . . . . . . . . . 3

2.2.1 Tokenization and input processing . . . . . . . . . . . . . . . . 4
2.2.2 Input embedding and positional encoding . . . . . . . . . . . . 5
2.2.3 Self-attention in encoder . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Multi-head attention . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Decoder input embedding . . . . . . . . . . . . . . . . . . . . 9
2.2.6 Masked multi-head attention in the decoder . . . . . . . . . . 10
2.2.7 Encoder-decoder attention . . . . . . . . . . . . . . . . . . . . 10
2.2.8 Linear and Softmax layer . . . . . . . . . . . . . . . . . . . . . 10

2.3 Creating the vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 The look-up vocabulary . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 N-gram algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Byte Pair Encoding algorithm . . . . . . . . . . . . . . . . . . 11

3 Methods 13
3.1 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Creating the vocabulary using BPE algorithm . . . . . . . . . . . . . 13
3.3 Creating the vocabulary using n-gram algorithm . . . . . . . . . . . . 13
3.4 Size of trained vocabularies . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Hardware and Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 16

5 Conclusion 19

Bibliography 20

A Appendix 1 I

viii


1
Introduction

In order to create new drugs with desirable properties (highly potent, effective in
in vivo models, metabolic stable and no toxicity issues) the drug discovery process
consists of multiple iterations of the design-make-test-analyse cycle (DMTA). In this
process the newly designed molecules are created in laboratories, tested and anal-
ysed [1]. One DMTA cycle can take up to 6 weeks [2], however with the use of AI
one can speed up the process. When a new drug is designed the next step is to figure
out how to make it, which chemical reactants are needed? Retrosynthesis predic-
tion is the process of predicting the necessary reactant molecules given a product
molecule. In the early stage synthesis prediction was carried out by the computer
program Logic and Heuristics Applied to Synthetic Analysis (LHASA), a rule-based
system developed by Corey in the 1960s [4]. Even though computer aided synthesis
prediction has achieved great advancements the programs that work best are still
symbolic rule-based and heuristics based methods.

The AI assisted retrosynthesis planning can give many outcomes. Forward syn-
thesis prediction is carried out in order to ensure that the predicted reactants will
react and produce the desired product and it also suggests the reaction conditions
needed [3]. In this project we will focus on forward synthesis prediction.

Machine translation can be used to predict or translate reactants to products. Each
molecule is expressed in the form of Simplified Molecular Input Line Entry System
(SMILES) strings, it consists of a set of rules on how to express chemical structures
in ASCII strings. A transformer network can take the reactants SMILES strings as
input and outputs or translates it to product SMILES. Tokenization is the process
of splitting up SMILES strings into smaller sub-strings before entering the neural
network. The goal of this project is to test different tokenization schemes created by
two different tokenization algorithms namely byte pair encoding and n-gram anal-
ysis, to see if it can increase test accuracy for a reaction prediction task and also
decrease training time for a transformer network.

1


2
Theory

2.1 SMILES

Chemical structures are often expressed as molecular graphs which is easier for hu-
mans to visualize. However, in order for a computer to understand and process
chemical structures they need to be converted to SMILES strings, which stands for
Simplified Molecular Input Line Entry System. It is a set of predefined rules on
how to represent graph-based chemical structures in ASCII notation. For example
non-organic atoms, charged atoms or organic atoms with unusual valence are writ-
ten inside square brackets. Single, double, triple, and aromatic bonds are written
with the notations -, =, , :. Hydrogen bonds do not need to be written out [9].

Same SMILES string can be written differently, as can be seen in figure 2.1. For
example all of the following SMILES strings represents the same Toluene molecule;
Cc1cccc1, clccccc1C, c1(C)ccccc1 [10]. Canonicalization of SMILES strings pro-
duces canonical SMILES which make sure that each molecule is always expressed
in one way and can be used as a reference. However different toolkits uses differ-
ent algorithm for canonicalization and the generated canonical SMILES may differ
between toolkits. In this project RDKit was used.

Figure 2.1

2


2. Theory

2.2 Architecture of the classic Transformer

Figure 2.2: A schematic illustration of the transformer. Teacher forcing is applied
where the correct SMILES substrings are fed into the decoder shifted one step to
the right.

A transformer is a deep neural network that can handle sequential data, for example
in natural language texts, time series data or genome sequences [11]. Reaction
prediction can be treated as a machine translation task where the SMILES sequences
are translated from reactants to products.

The classic transformer network consists mainly of encoders and decoders. The
encoders can be broken down into one self-attention layer and one feed forward layer
while the decoder consists of one self-attention layer, one encoder-decoder attention
layer and one feed forward layer [12].

3


2. Theory

Figure 2.3: A more detailed view of the transformer.

2.2.1 Tokenization and input processing
Tokenization is the process of splitting up text strings (reactant and product SMILES
in our case) into smaller string units before entering the first encoder and decoder.
A vocabulary gives instructions on how to split the SMILES strings.

In order for the Transformer to understand and process the SMILES tokens they
need to be converted to numerical values before entering the first encoder. Each

4


2. Theory

SMILES token is converted to its unique index. The mapping information between
index and token is stored in the vocabulary which serves the function of a lookup
dictionary which has the following structure

{ 1: ’token string 1’, 2: ’token string 2’, 3: ’token string 3’,... }

2.2.2 Input embedding and positional encoding

Before entering the first encoder (or the first decoder) the token indices are embed-
ded in vectors with n rows and d columns where each row represents the encoding
for a token in the sequence.d is the embedding dimension, a hyperparameter to be
set. Initially each row contain the SMILES token index padded with zeros.

A positional embedding with same dimension as the input embedding is added to
keep track of the token sequence order. The positional encoding (PE) is created by
functions of sine and cosine with varying frequencies

PE(pos, 2i) = sin(pos/100002i/dmodel)

PE(pos, 2i + 1) = cos(pos/100002i/dmodel),

where pos is the order of the current token in the token sequence, i refers to the ith
element of the embedding with size d. This is also illustrated in figure 2.4. The sine
function is used for every even input sequence order and cosine for every uneven
sequence order. Tokens that are closed to each other in the sequence get similar
(slightly shifted) values in the position vector, tokens that are positioned distant
from each other get more dissimilar vector values as illustrated in figure 2.5 [18].

5


2. Theory

Figure 2.4: The positional embedding (PE) value on the y-axis is created by
functions of sine and cosine. Tokens that lay close to each other in sequence order
will have similar values.

Figure 2.5: The color bar indicates the positional embedding value for each token
sequence order in y-axis and embedding element in x-axis. The pattern for each
position is more dissimilar the further away they are located from each other [18].

2.2.3 Self-attention in encoder
The self-attention mechanism allows the Transformer to associate a string input to
other string inputs that are similar to the one being processed. A query, key and

6


2. Theory

value matrix are created by taking matrix multiplication of the input matrix with
the respective weights that are trained as can be seen in figure 2.6. Each row in the
input matrix corresponds to one embedded token, with n being the total number of
tokens in the sequence. The column size equals the embedding dimension d.

Figure 2.6: The query (Q), key (K), and value (V) matrix is created by multiplying
the input with the different weights.

The next step is the scaled dot-product attention where the query matrix is mul-
tiplied by the key matrix, each value is then divided by the square root of

√
dk,

where dk is the column size of the key matrix. The softmax is then applied and the
resulting matrix is multiplied by the value matrix to get the final attention output
(figure 2.7).

Figure 2.7: Visualisation of the scaled dot-product attention to produce the final
attention output Z [12].

7


2. Theory

2.2.4 Multi-head attention
Instead of having only one query, key and value matrix to produce one single output
matrix in the self-attention layer, the multi-head attention produces several query,
key and value matrices which in turn give rise to several attention output matrices
(figure 2.8 and 2.9) which are being concatenated into one single large matrix.

Figure 2.8: Visualisation of the multihead attention for the first encoder [12]. The
input is multiplied with several weights for query, key and value to produce the Q,
K and V matrices in each attention head.

Figure 2.9: An illustration of the multihead attention for the first encoder resulting
in several attention outputs Z [12].

In order to get the correct dimension and to average the values of the output matrices

8


2. Theory

the concatenated matrix is being multiplied by a weight matrix (figure 2.10) which
was also trained.

Figure 2.10: To get the correct output dimension the concatenated attention head
is multiplied by an additional weight matrix that was also trained [12].

An illustration of the whole process can be seen in figure 2.11. The input X for the
first encoder is the embedded token with positional embedding added. For the rest
of the encoders the input R comes from the output of the previous encoder.

Figure 2.11: An overview of the multihead attention steps [12].

2.2.5 Decoder input embedding
The decoder input consists of SMILES token sequences inside the embedding vector
similar to the encoder embedding. However when training the network a method

9


2. Theory

called teacher-forcing is used. This method feeds in the expected target tokens
shifted by one step to the right, instead of using the real decoder outputs.

2.2.6 Masked multi-head attention in the decoder
The purpose of masked attention is to prevent each position from attending to token
positions ahead. Each position is only allowed to attend previous positions up to
the current one. The input matrix I created by

I = Q×KT

is masked by adding Imask with same dimension as I (figure 2.12).

Figure 2.12: Masking by adding -inf at the positions to be masked.

After masking and applying the Softmax function we get

I ′ = Softmax(I + Imask√
dk

)

dk is the dimension of the key vector. The masking matrix contains -inf at the
positions to be masked and zero otherwise. Applying the Softmax function with -inf
gives a value of zero at the masked positions.

2.2.7 Encoder-decoder attention
In addition to the masked multi-head attention sublayer the decoder also contains
an encoder-decoder multi-head attention. This layer takes in the query from the
previous decoder layer, the keys and values are created from the output of the last
encoder.

2.2.8 Linear and Softmax layer
The linear layer converts decoder to logit tensor with the number of elements equal
to the trained vocabulary size [12]. Each position in the logit tensor corresponds to
a vocabulary token and is given a score ranging from [-inf, + inf]. The token with
highest score gets selected as the next token, this sampling method is called greedy
search.

10


2. Theory

2.3 Creating the vocabulary

2.3.1 The look-up vocabulary
The vocabulary is a dictionary mapping each key, containing the token strings to
their respective index. In addition the vocabulary also contains key and values for
start- and end-of-sequence-tokens. Before entering the first encoder or decoder layer
the tokens need to the converted to numerical values, the vocabulary stores infor-
mation on how to map tokens to their index and vice versa.

During tokenization the index order of the vocabulary decides how the SMILES
strings should be split up, that is, during tokenization the algorithm screens the
data set and split segments containing the tokens with index 1, then it does the
same for tokens with index 2 in the vocabulary and so on. The characters left that
are not inside the vocabulary is splitted into single characters.

2.3.2 N-gram algorithm
The n-gram algorithm used in this project was inspired by n-grams and was adapted
for the implementation on tokenization. An n-gram consists of consecutive seg-
ments of a string with predefined segment length. As an example, the string
’Cn2ccc3ccccc’ forms the folllowing 8-grams: ’Cn2ccc3c’, ’n2ccc3cc’, ’ccc3ccc’,
’cc3cccc’ and ’c3ccccc’, where the number of sequences are 5. The same string
forms the following 3-grams: ’Cn2’, ’n2c’, ’2cc’,’ccc’,’cc3’,’c3c’, ’3cc’,’ccc’,
’ccc’ and ’ccc’ with 8 sequences. One major advantage of using n-gram is the
ability to set token length which is not possible with BPE. Shorter n-grams generate
longer sequences which makes training slower where the longest sequences would be
generated from tokenization on character level. If the whole string was kept intact
the sequence length would be very short, however important relational information
contained inside that string would be missed out.

2.3.3 Byte Pair Encoding algorithm
Byte Pair encoding (BPE) is a data compression algorithm that was introduced by
Philip Gage (1994) [5]. The most frequent sequence of byte pair would be replaced
by another byte that did not exist in the data, a table that stored information about
the replacements would be use to retrieve back the original byte pairs. This com-
pression method was further adapted by Sennrich et al. (2016) [6] to be used in
context with neural machine translation. Here the words are initially separated on
character level and the most frequently occurring sequence pair in the whole text is
merged and added to the trained vocabulary, hence common character sequences are
grouped together into smaller subword units. The iteration stops when the desired
vocabulary size is reached, which is a parameter to be set. The training data can
then be tokenized according to the trained vocabulary containing all the merged

11


2. Theory

substrings and all single characters that were not merged.

Creating the vocabulary using Byte Pair Encoding

1. Tokenize the SMILES dataset at character level.

2. Initialize the vocabulary with all unique tokens from the dataset.

3. Iteratively count the occurrence of all token pairs in the tokenized SMILES,
merge the most frequent occurring token pair as a new token and add it to the
vocabulary.

Repeat step 3. until the desired vocabulary size is reached.

12


3
Methods

3.1 Data sources
Two data sets were used. The smaller ’Pande’ data set [13] contains 50,037 reactions
with no reagents and consisted of 10 reaction types. The ’MIT mixed’ data set [14]
contains 479,035 reactions mixed with reagents.

3.2 Creating the vocabulary using BPE algorithm
The steps are listed in section 2.3.3. Five vocabularies with increasing sizes were
created for the Pande and MIT mixed data set, as seen in table 3.1.

3.3 Creating the vocabulary using n-gram algo-
rithm

The first step in creating a vocabulary using n-gram algorithm is to split up all
SMILES strings into n-grams in a sliding window fashion. n defines the token length.
In this case the iteration started with n = 8, where the purpose is to capture all
benzene rings, represented as c1ccccc1 in SMILES notation. The most frequent n-
gram was removed from the SMILES string and added to the trained vocabulary. In
the case when the cut out n-gram was located in inside the string, two new substrings
would emerge and they would be added to the ’pool’. This process is repeated k
times with k being a parameter to be chosen. In the next iteration the counts of
the n-grams included in the ’pool’ with token length n are also considered. After
repeating this process k times n is set to n− 1, the iteration stops when n = 1. At
the end, all unique single characters are also added to the vocabulary

13


3. Methods

Creating the vocabulary using n-gram analysis

1. Tokenize the SMILES dataset into n-grams with token length equals to n.

2. count the frequency of all unique n-grams in the whole data set including the
ones in the ’pool’ which have token length = n (the ’pool’ is initially empty).

3. Remove the n-gram with highest frequency and add it to the vocabulary.

4. If the removed n-gram was taken from the data set (and not from the ’pool’)
and if new SMILES tokens were created in this process, the new tokens are added
to the ’pool’.

4. Repeat step 1 - 4 k times.

5. Set n = n− 1.

6. Repeat step 1 - 5. and stop when n = 1.

7. Add all unique single characters to the vocabulary.

n and k are parameters to be set.

Ultimately the parameters n and k will decide vocabulary size, which is shown in
table 3.1.

3.4 Size of trained vocabularies

Five different vocabularies with increasing sizes were created with BPE and n-gram
algorithm. In addition, two baseline vocabularies containing only unique single
characters that existed in the two data sets were also generated in order to compare
the results.

14


3. Methods

Data set vocabulary sizes
(single characters included)

number of tokens
(single characters not included)

Pande

54
61
68
75
82

7
14
21
28
35

MIT mixed

66
73
80
87
94

7
14
21
28
35

Table 3.1: Table showing number of tokens added to the trained vocabularies and
the total vocabulary size.

3.5 Hardware and Schedule
We used one machine with 1 NVIDIA P100 GPU. The training was done for 50
epochs with batch size 64. Adam optimizer was applied and learning rate was set
to 0.0004.

15


4
Results

For the smaller data set, test accuracy decreased comparing to character baseline
when vocabulary size (or the rate of compression) increased, as can be seen in figure
4.1. There was however a small increase in test accuracy for the BPE method with a
vocabulary size of 54. Test accuracy for the larger data set was comparable to char-
acter baseline when applying the two tokenization methods, even when vocabulary
size increased.

Figure 4.1: Test accuracies for two data sets using Byte Pair Encoding and n-gram
analysis for tokenization with varying vocabulary size.

Figure 4.2 shows the comparison in composition for vocabularies created by the two
different methods when using the smaller data set of vocabulary size = 54. It can be
noticed that the BPE algorithm generated tokens that were of same length while the
vocabulary generated by the n-gram method contained tokens with various lengths
due to the nature of the algorithm.

16


4. Results

Figure 4.2: The bar plot shows the count of various token length constituting the
vocabulary of size 54. They were created from the smaller data set. Differences in
SMILES tokens and their lengths using BPE and n-gram algorithm are shown in
the table.

Figure 4.3: Percentage of training time comparing to character baseline, using
training data generated from Byte Pair Encoding and n-gram analysis tokenization
with varying vocabulary size.

17


4. Results

The bar plot on figure 4.4 shows the mean percentage value of compressed tokens di-
vided by baseline character tokens for the two methods. As vocabulary size increases
the data gets more compressed.
An increase in vocabulary size seems to decrease sequence length as shown in figure
4.4. However, the declining in sequence length seems to decrease as vocabulary size
becomes larger. For example, the change in average percentage of baseline sequence
length was between 4−3% going from vocabulary size 87 to 94 compared to a change
around 8− 10% when vocabulary size decreased from 66 to 73.

Figure 4.4: The bar plot shows the percentage of average sequence length after
compression over character baseline sequence length.

18


5
Conclusion

In this project two different tokenization algorithms were tested on two data sources.
For the smaller ’Pande’ data set [13] test accuracy decreased as vocabulary size in-
creased for both Byte Pair encoding and ngram method, with the exception when
vocabulary size was 54. The cause behind this deviation is not known and could be
looked further into. For the larger ’MIT mixed’ data set [14] accuracy was similar
to character baseline when comparing the two methods.

When vocabulary size was increased for the ’MIT’ data set the average percent-
age of base line sequence length decreased, this would shorten training time. At
most, when compression was 59% of character baseline (for ’MIT mixed’) training
time decreased to 62% (for ngram method) and 69% (for byte pair encoding method)
of baseline training time, without affecting test accuracy significantly. This methods
might be suitable for large data sets to decrease training time.

19


Bibliography

[1] AstraZeneca iLab: The automated lab of the future.
https://www.astrazeneca.com/r-d/our-technologies/ilab.html,
October 2021.

[2] Accelerating chemical design and synthesis using artificial intelligence - open
workshop.
https://www.ri.se/sites/default/files/2020-07/RISE%20Open
%20Workshop%202020-05-29%20Conference%20Binder.pdf, May 2020.

[3] Simon Johansson, Amol Thakkar, Thierry Kogej, Esben Bjerrum, Samuel Gen-
heden, Tomas Bastys, ChristosKannas, Alexander Schliep, Hongming Chen,
and Ola Engkvist. Ai-assisted synthesis prediction. Drug Discovery Today:
Technologies, 32:65–72, 2019.

[4] Pensak, D. A., & Corey, E. J.(1977).
LHASA—logic and heuristics applied to synthetic analysis.

[5] Gage, P.(1994). A new algorithm for data compression. C Users Journal, 12(2),
23-38.

[6] Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of
rare words with subword units.arXiv preprint arXiv:1508.07909.

[7] Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language
independent subword tokenizer and detokenizer for neural text processing.
arXiv preprint arXiv:1808.06226.

[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems (pp. 5998-6008).

[9] Weininger, D.(1988). SMILES, a chemical language and information system. 1.
Introduction to methodology and encoding rules.Journal of chemical informa-
tion and computer sciences, 28(1), 31-36.

[10] Bjerrum, E. J. (2017). SMILES enumeration as data augmentation for neural
network modeling of molecules. arXiv preprint arXiv:1703.07076.

[11] Wood, Thomas. Transformer Neural Network
https://deepai.org/machinelearning-glossary-and-terms/
transformer-neural-network, October 2021.

[12] Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from
https://jalammar.github.io/illustrated-transformer/

[13] Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., Gomes, J., Luu Nguyen, Q., ...
& Pande, V. (2017). Retrosynthetic reaction prediction using neural sequence-
to-sequence models. ACS central science, 3(10), 1103-1113.

20


Bibliography

[14] Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C. A., Bekas, C., &
Lee, A. A. (2019). Molecular transformer: a model for uncertainty-calibrated
chemical reaction prediction. ACS central science, 5(9), 1572-1583.

[15] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ...& Chin-
tala, S. (2019). Pytorch: An imperative style, high-performance deep learning
library. Advances in neural information processing systems, 32, 8026-8037.

[16] Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Ver-
sion 1.4) [Computer software]. https://doi.org/10.5281/zenodo.3828935

[17] Kingma, D. P.,& Ba, J. (2014). Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.

[18] Kazemnejad, Amirhossein (2019). Transformer Architec-
ture: The Positional Encoding [Blog post]. Retrieved from
https://kazemnejad.com/blog/transformer_architecture
_positional_encoding/.

21


A
Appendix 1

I


DEPARTMENT OF PHYSICS
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	Introduction
	Theory
	SMILES
	Architecture of the classic Transformer
	Tokenization and input processing
	Input embedding and positional encoding
	Self-attention in encoder
	Multi-head attention
	Decoder input embedding
	Masked multi-head attention in the decoder
	Encoder-decoder attention
	Linear and Softmax layer

	Creating the vocabulary
	The look-up vocabulary
	N-gram algorithm
	Byte Pair Encoding algorithm


	Methods
	Data sources
	Creating the vocabulary using BPE algorithm
	Creating the vocabulary using n-gram algorithm
	Size of trained vocabularies
	Hardware and Schedule

	Results
	Conclusion
	Bibliography
	Appendix 1