ODR kommer att vara otillgängligt pga systemunderhåll onsdag 25 februari, 13:00 -15:00 (ca). Var vänlig och logga ut i god tid. // ODR will be unavailable due to system maintenance, Wednesday February 25, 13:00 - 15:00. Please log out in due time.
 

Emerging Architectures for Chemical Language Modeling

dc.contributor.authorHagström, Ester
dc.contributor.authorRedmo Axelsson, Erik
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerKemp, Graham
dc.contributor.supervisorBrunnsåker, Daniel
dc.date.accessioned2026-02-04T14:10:34Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractIn recent years, language modeling architectures have become increasingly prominent in the field of generative chemistry, offering new approaches for the de novo design and optimization of small molecules. This thesis presents a comparative study of two emerging architectures: the decoder-only Transformer and the Mamba architecture, and a conventional Recurrent Neural Network with LSTM cells. The investigation explores how choices in training data, including a targeted medicinal dataset (ChEMBL) and a chemically broad dataset (PubChem), as well as data augmentation via randomized SMILES representations, influence generative capacity and chemical space coverage. In addition to this, task-specific optimization of models through reinforcement learning is studied, and the models are compared with respect to their ability to generate diverse molecules with desired properties. Through pretraining experiments, it is shown that while the Mamba and RNN architectures reach their optimum performance significantly faster, the decoder-only Transformer achieves the highest validity and uniqueness in molecular generation. Training on PubChem, as opposed to ChEMBL, generally enhances validity and uniqueness but tends to reduce novelty, indicating a trade-off between chemical space saturation and innovation. As for data augmentation through randomization of SMILES, this helped all models refrain from memorizing the dataset, resulting in higher novelty across architectures and datasets. Reinforcement learning experiments further reveal that all three architectures are capable of optimizing toward specific molecular properties, with the decoder-only Transformer and Mamba each exhibiting distinct strengths depending on the optimization task. Regarding the pretraining condition’s effect on reinforcement learning, ChEMBL-trained models outperformed those trained with PubChem on multiple tasks, and all architectures, but especially Mamba, benefitted from being pretrained with randomized SMILES. Notably, even reduced-parameter models, such as a downsized decoder-only Transformer variant, perform competitively relative to larger architectures.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310959
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectDeep learning
dc.subjectchemistry
dc.subjectdrug design
dc.subjectmamba
dc.subjectssm
dc.subjectstate space
dc.subjecttransformer
dc.subjectLSTM
dc.titleEmerging Architectures for Chemical Language Modeling
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComplex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 25-170 EH ERA.pdf
Storlek:
7.54 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: