Emerging Architectures for Chemical Language Modeling

Hagström, Ester; Redmo Axelsson, Erik

Emerging Architectures for Chemical Language Modeling

dc.contributor.author	Hagström, Ester
dc.contributor.author	Redmo Axelsson, Erik
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Kemp, Graham
dc.contributor.supervisor	Brunnsåker, Daniel
dc.date.accessioned	2026-02-04T14:10:34Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	In recent years, language modeling architectures have become increasingly prominent in the field of generative chemistry, offering new approaches for the de novo design and optimization of small molecules. This thesis presents a comparative study of two emerging architectures: the decoder-only Transformer and the Mamba architecture, and a conventional Recurrent Neural Network with LSTM cells. The investigation explores how choices in training data, including a targeted medicinal dataset (ChEMBL) and a chemically broad dataset (PubChem), as well as data augmentation via randomized SMILES representations, influence generative capacity and chemical space coverage. In addition to this, task-specific optimization of models through reinforcement learning is studied, and the models are compared with respect to their ability to generate diverse molecules with desired properties. Through pretraining experiments, it is shown that while the Mamba and RNN architectures reach their optimum performance significantly faster, the decoder-only Transformer achieves the highest validity and uniqueness in molecular generation. Training on PubChem, as opposed to ChEMBL, generally enhances validity and uniqueness but tends to reduce novelty, indicating a trade-off between chemical space saturation and innovation. As for data augmentation through randomization of SMILES, this helped all models refrain from memorizing the dataset, resulting in higher novelty across architectures and datasets. Reinforcement learning experiments further reveal that all three architectures are capable of optimizing toward specific molecular properties, with the decoder-only Transformer and Mamba each exhibiting distinct strengths depending on the optimization task. Regarding the pretraining condition’s effect on reinforcement learning, ChEMBL-trained models outperformed those trained with PubChem on multiple tasks, and all architectures, but especially Mamba, benefitted from being pretrained with randomized SMILES. Notably, even reduced-parameter models, such as a downsized decoder-only Transformer variant, perform competitively relative to larger architectures.
dc.identifier.coursecode	DATX05
dc.identifier.uri	https://hdl.handle.net/20.500.12380/310959
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Deep learning
dc.subject	chemistry
dc.subject	drug design
dc.subject	mamba
dc.subject	ssm
dc.subject	state space
dc.subject	transformer
dc.subject	LSTM
dc.title	Emerging Architectures for Chemical Language Modeling
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Complex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 25-170 EH ERA.pdf
Size:: 7.54 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen