Optimization of Molecular Transformers: Influence of tokenization schemes

Tran, Katarina

Optimization of Molecular Transformers: Influence of tokenization schemes

dc.contributor.author	Tran, Katarina
dc.contributor.department	Chalmers tekniska högskola / Institutionen för fysik	sv
dc.contributor.examiner	Midtvedt, Daniel
dc.contributor.supervisor	Bjerrum, Esben Jannik
dc.contributor.supervisor	Lundh, Torbjörn
dc.date.accessioned	2022-02-01T12:52:08Z
dc.date.available	2022-02-01T12:52:08Z
dc.date.issued	2021	sv
dc.date.submitted	2020
dc.description.abstract	Synthesis prediction is applied in the ’make’ phase of the design-make-test-analyze cycle (DMTA) of drug discovery in order to synthesize small organic molecules. Recently, it was shown that synthesis prediction can be treated as a language translation task using Transformer models. The input reactants and products are expressed in the form of Simplified Molecular Input Line Entry System (SMILES) which is a string based molecular structure format. However, the Transformer model seems to often make errors when outputting SMILES which contains many copies of the same character. As example, the most common cases involve chains containing many carbons. Moreover, due to the nature of the attention mechanism the training time is proportional to n2 where n is input sequence length and for large databases and long SMILES strings, the long training times hampers development. We hypothesised that by splitting up the SMILES in a strategic way the network would train faster and perform better in the differentiation task compared to single character baseline. Experiments with two different tokenization (string segmentation) algorithms with Byte Pair Encoding (BPE) and n-grams were conducted. Different tokenization schemes were created from these algorithms by varying the hyperparameters. Shortening of the tokenized training sequences by having multiple character tokens should also shorten training time significantly, since these tokenizations methods also compressed the input data. The experiments were conducted on two different datasets, one smaller containing 50,037 chemical reactions and a larger with 479,035 reactions. Contrary to prior belief, for the smaller dataset the performance decreased when tokenized according to the algorithms compared to the single character baseline. On the other hand, the performance for the larger dataset was similar to the baseline while training time was decrease by almost 40% for the n-gram algorithm tokenization and around 30% for the Byte Pair Encoding tokenization. These experiments show that Byte Pair Encoding and n-gram tokenization did not contribute to an increase in accuracy for the Transformer network for these two datasets. However, these methods might still be applicable when data size is large enough in order to speedup training without affecting accuracy	sv
dc.identifier.coursecode	TIFX05	sv
dc.identifier.uri	https://hdl.handle.net/20.500.12380/304458
dc.language.iso	eng	sv
dc.setspec.uppsok	PhysicsChemistryMaths
dc.subject	drug discovery	sv
dc.subject	neural machine translation	sv
dc.subject	Transformer	sv
dc.subject	training time	sv
dc.subject	data compression	sv
dc.subject	Byte Pair Encoding	sv
dc.subject	n-gram	sv
dc.title	Optimization of Molecular Transformers: Influence of tokenization schemes	sv
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.uppsok	H
local.programme	Complex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: Master_Thesis_Katarina_Tran.pdf
Storlek:: 1.33 MB
Format:: Adobe Portable Document Format
Beskrivning:

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Storlek:: 1.51 KB
Format:: Item-specific license agreed upon to submission
Beskrivning:

Ladda ner

Samlingar

Examensarbeten för masterexamen