Optimization of Molecular Transformers: Influence of tokenization schemes

dc.contributor.authorTran, Katarina
dc.contributor.departmentChalmers tekniska högskola / Institutionen för fysiksv
dc.contributor.examinerMidtvedt, Daniel
dc.contributor.supervisorBjerrum, Esben Jannik
dc.contributor.supervisorLundh, Torbjörn
dc.date.accessioned2022-02-01T12:52:08Z
dc.date.available2022-02-01T12:52:08Z
dc.date.issued2021sv
dc.date.submitted2020
dc.description.abstractSynthesis prediction is applied in the ’make’ phase of the design-make-test-analyze cycle (DMTA) of drug discovery in order to synthesize small organic molecules. Recently, it was shown that synthesis prediction can be treated as a language translation task using Transformer models. The input reactants and products are expressed in the form of Simplified Molecular Input Line Entry System (SMILES) which is a string based molecular structure format. However, the Transformer model seems to often make errors when outputting SMILES which contains many copies of the same character. As example, the most common cases involve chains containing many carbons. Moreover, due to the nature of the attention mechanism the training time is proportional to n2 where n is input sequence length and for large databases and long SMILES strings, the long training times hampers development. We hypothesised that by splitting up the SMILES in a strategic way the network would train faster and perform better in the differentiation task compared to single character baseline. Experiments with two different tokenization (string segmentation) algorithms with Byte Pair Encoding (BPE) and n-grams were conducted. Different tokenization schemes were created from these algorithms by varying the hyperparameters. Shortening of the tokenized training sequences by having multiple character tokens should also shorten training time significantly, since these tokenizations methods also compressed the input data. The experiments were conducted on two different datasets, one smaller containing 50,037 chemical reactions and a larger with 479,035 reactions. Contrary to prior belief, for the smaller dataset the performance decreased when tokenized according to the algorithms compared to the single character baseline. On the other hand, the performance for the larger dataset was similar to the baseline while training time was decrease by almost 40% for the n-gram algorithm tokenization and around 30% for the Byte Pair Encoding tokenization. These experiments show that Byte Pair Encoding and n-gram tokenization did not contribute to an increase in accuracy for the Transformer network for these two datasets. However, these methods might still be applicable when data size is large enough in order to speedup training without affecting accuracysv
dc.identifier.coursecodeTIFX05sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/304458
dc.language.isoengsv
dc.setspec.uppsokPhysicsChemistryMaths
dc.subjectdrug discoverysv
dc.subjectneural machine translationsv
dc.subjectTransformersv
dc.subjecttraining timesv
dc.subjectdata compressionsv
dc.subjectByte Pair Encodingsv
dc.subjectn-gramsv
dc.titleOptimization of Molecular Transformers: Influence of tokenization schemessv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH
local.programmeComplex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Master_Thesis_Katarina_Tran.pdf
Storlek:
1.33 MB
Format:
Adobe Portable Document Format
Beskrivning:

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.51 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: