Emerging Architectures for Chemical Language Modeling
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In recent years, language modeling architectures have become increasingly prominent
in the field of generative chemistry, offering new approaches for the de novo
design and optimization of small molecules. This thesis presents a comparative
study of two emerging architectures: the decoder-only Transformer and the Mamba
architecture, and a conventional Recurrent Neural Network with LSTM cells. The
investigation explores how choices in training data, including a targeted medicinal
dataset (ChEMBL) and a chemically broad dataset (PubChem), as well as data
augmentation via randomized SMILES representations, influence generative capacity
and chemical space coverage. In addition to this, task-specific optimization of
models through reinforcement learning is studied, and the models are compared with
respect to their ability to generate diverse molecules with desired properties.
Through pretraining experiments, it is shown that while the Mamba and RNN architectures
reach their optimum performance significantly faster, the decoder-only
Transformer achieves the highest validity and uniqueness in molecular generation.
Training on PubChem, as opposed to ChEMBL, generally enhances validity and
uniqueness but tends to reduce novelty, indicating a trade-off between chemical
space saturation and innovation. As for data augmentation through randomization
of SMILES, this helped all models refrain from memorizing the dataset, resulting in
higher novelty across architectures and datasets.
Reinforcement learning experiments further reveal that all three architectures are
capable of optimizing toward specific molecular properties, with the decoder-only
Transformer and Mamba each exhibiting distinct strengths depending on the optimization
task. Regarding the pretraining condition’s effect on reinforcement learning,
ChEMBL-trained models outperformed those trained with PubChem on multiple
tasks, and all architectures, but especially Mamba, benefitted from being pretrained
with randomized SMILES. Notably, even reduced-parameter models, such
as a downsized decoder-only Transformer variant, perform competitively relative to
larger architectures.
Beskrivning
Ämne/nyckelord
Deep learning, chemistry, drug design, mamba, ssm, state space, transformer, LSTM
