Optimization of Molecular Transformers: Influence of tokenization schemes

Publicerad

Författare

Typ

Examensarbete för masterexamen

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Synthesis prediction is applied in the ’make’ phase of the design-make-test-analyze cycle (DMTA) of drug discovery in order to synthesize small organic molecules. Recently, it was shown that synthesis prediction can be treated as a language translation task using Transformer models. The input reactants and products are expressed in the form of Simplified Molecular Input Line Entry System (SMILES) which is a string based molecular structure format. However, the Transformer model seems to often make errors when outputting SMILES which contains many copies of the same character. As example, the most common cases involve chains containing many carbons. Moreover, due to the nature of the attention mechanism the training time is proportional to n2 where n is input sequence length and for large databases and long SMILES strings, the long training times hampers development. We hypothesised that by splitting up the SMILES in a strategic way the network would train faster and perform better in the differentiation task compared to single character baseline. Experiments with two different tokenization (string segmentation) algorithms with Byte Pair Encoding (BPE) and n-grams were conducted. Different tokenization schemes were created from these algorithms by varying the hyperparameters. Shortening of the tokenized training sequences by having multiple character tokens should also shorten training time significantly, since these tokenizations methods also compressed the input data. The experiments were conducted on two different datasets, one smaller containing 50,037 chemical reactions and a larger with 479,035 reactions. Contrary to prior belief, for the smaller dataset the performance decreased when tokenized according to the algorithms compared to the single character baseline. On the other hand, the performance for the larger dataset was similar to the baseline while training time was decrease by almost 40% for the n-gram algorithm tokenization and around 30% for the Byte Pair Encoding tokenization. These experiments show that Byte Pair Encoding and n-gram tokenization did not contribute to an increase in accuracy for the Transformer network for these two datasets. However, these methods might still be applicable when data size is large enough in order to speedup training without affecting accuracy

Beskrivning

Ämne/nyckelord

drug discovery, neural machine translation, Transformer, training time, data compression, Byte Pair Encoding, n-gram

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced