Optimization of Molecular Transformers: Influence of tokenization schemes
Publicerad
Författare
Typ
Examensarbete för masterexamen
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Synthesis prediction is applied in the ’make’ phase of the design-make-test-analyze
cycle (DMTA) of drug discovery in order to synthesize small organic molecules. Recently,
it was shown that synthesis prediction can be treated as a language translation
task using Transformer models. The input reactants and products are expressed
in the form of Simplified Molecular Input Line Entry System (SMILES) which is a
string based molecular structure format. However, the Transformer model seems to
often make errors when outputting SMILES which contains many copies of the same
character. As example, the most common cases involve chains containing many carbons.
Moreover, due to the nature of the attention mechanism the training time is
proportional to n2 where n is input sequence length and for large databases and long
SMILES strings, the long training times hampers development. We hypothesised
that by splitting up the SMILES in a strategic way the network would train faster
and perform better in the differentiation task compared to single character baseline.
Experiments with two different tokenization (string segmentation) algorithms
with Byte Pair Encoding (BPE) and n-grams were conducted. Different tokenization
schemes were created from these algorithms by varying the hyperparameters. Shortening
of the tokenized training sequences by having multiple character tokens should
also shorten training time significantly, since these tokenizations methods also compressed
the input data. The experiments were conducted on two different datasets,
one smaller containing 50,037 chemical reactions and a larger with 479,035 reactions.
Contrary to prior belief, for the smaller dataset the performance decreased when tokenized
according to the algorithms compared to the single character baseline. On
the other hand, the performance for the larger dataset was similar to the baseline
while training time was decrease by almost 40% for the n-gram algorithm tokenization
and around 30% for the Byte Pair Encoding tokenization. These experiments
show that Byte Pair Encoding and n-gram tokenization did not contribute to an
increase in accuracy for the Transformer network for these two datasets. However,
these methods might still be applicable when data size is large enough in order to
speedup training without affecting accuracy
Beskrivning
Ämne/nyckelord
drug discovery, neural machine translation, Transformer, training time, data compression, Byte Pair Encoding, n-gram