Emerging Architectures for Chemical Language Modeling
| dc.contributor.author | Hagström, Ester | |
| dc.contributor.author | Redmo Axelsson, Erik | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Kemp, Graham | |
| dc.contributor.supervisor | Brunnsåker, Daniel | |
| dc.date.accessioned | 2026-02-04T14:10:34Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | In recent years, language modeling architectures have become increasingly prominent in the field of generative chemistry, offering new approaches for the de novo design and optimization of small molecules. This thesis presents a comparative study of two emerging architectures: the decoder-only Transformer and the Mamba architecture, and a conventional Recurrent Neural Network with LSTM cells. The investigation explores how choices in training data, including a targeted medicinal dataset (ChEMBL) and a chemically broad dataset (PubChem), as well as data augmentation via randomized SMILES representations, influence generative capacity and chemical space coverage. In addition to this, task-specific optimization of models through reinforcement learning is studied, and the models are compared with respect to their ability to generate diverse molecules with desired properties. Through pretraining experiments, it is shown that while the Mamba and RNN architectures reach their optimum performance significantly faster, the decoder-only Transformer achieves the highest validity and uniqueness in molecular generation. Training on PubChem, as opposed to ChEMBL, generally enhances validity and uniqueness but tends to reduce novelty, indicating a trade-off between chemical space saturation and innovation. As for data augmentation through randomization of SMILES, this helped all models refrain from memorizing the dataset, resulting in higher novelty across architectures and datasets. Reinforcement learning experiments further reveal that all three architectures are capable of optimizing toward specific molecular properties, with the decoder-only Transformer and Mamba each exhibiting distinct strengths depending on the optimization task. Regarding the pretraining condition’s effect on reinforcement learning, ChEMBL-trained models outperformed those trained with PubChem on multiple tasks, and all architectures, but especially Mamba, benefitted from being pretrained with randomized SMILES. Notably, even reduced-parameter models, such as a downsized decoder-only Transformer variant, perform competitively relative to larger architectures. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310959 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Deep learning | |
| dc.subject | chemistry | |
| dc.subject | drug design | |
| dc.subject | mamba | |
| dc.subject | ssm | |
| dc.subject | state space | |
| dc.subject | transformer | |
| dc.subject | LSTM | |
| dc.title | Emerging Architectures for Chemical Language Modeling | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Complex adaptive systems (MPCAS), MSc |
