Contrastive Learning For Molecular Representation

HANI, SALAM; LINDER,  JONATHAN

Contrastive Learning For Molecular Representation

Ladda ner

Primär fil CSE 25-134 SH JL.pdf (8.36 MB)

Publicerad

2025

Författare

HANI, SALAM

LINDER, JONATHAN

Typ

Examensarbete för masterexamen
Master's Thesis

Program

Computer science – algorithms, languages and logic (MPALG), MSc

Sammanfattning

This thesis explores the integration of contrastive learning into REINVENT, AstraZeneca’s in-house generative model for molecular design, with the aim of improving the model’s understanding of chemical equivalence between different SMILES representations of the same compound. To this end, a contrastive learning framework was developed, incorporating SMILES-based data augmentation techniques such as enumeration and subgraphing. The framework was evaluated on three datasets: a proprietary baseline derived from ChEMBL35, and the publicly available MOSES and GuacaMol datasets. To assess the impact of architectural design on performance, multiple model architectures were investigated, including a newly introduced intermediate architecture. Results indicate that the intermediate architecture consistently achieves higher validity across all datasets, but tends to reduce novelty. Furthermore, using multiple augmentation strategies improved the model’s ability to generate chemically diverse and novel compounds, as measured by metrics such as novelty and Fréchet ChemNet Distance (FCD). These findings suggest that contrastive learning can offer measurable benefits in de novo molecule generation, although its effectiveness may depend heavily on architecture and dataset-specific tuning.

Ämne/nyckelord

Artificial Intelligence, AI, Deep Learning, DL, Machine Learning, Computer Science, Computer Engineering, Contrastive learning, Self-supervised learning, Representation learning, Data augmentation, Embeddings, Latent space, Generative models, Recurrent neural networks, RNN, Long Short-Term Memory, LSTM, Neural architecture design, Hyperparameter tuning, Transfer learning, Reinforcement learning, NT-Xent loss, Negative log-likelihood, Benchmarking, PCA, T-SNE, UMAP, Drug discovery, De novo molecular generation, In silico screening, Molecular design, Molecular representations, Molecular fingerprints, SMILES, SMILES enumeration, SMILES randomization, Subgraph sampling, Canonicalization, Chemical space, Physicochemical properties, Tanimoto similarity, Fréchet ChemNet Distance, FCD, Synthetic accessibility, SA, Quantitative Estimate of Drug-likeness, QED, Stereoisomers, Stereochemistry, Stereocenters, Tautomerism, Validity, Novelty, Diversity, Internal diversity, IntDiv, ChEMBL35, MOSES benchmark, GuacaMol benchmark, Pharmaceutical AI, Cheminformatics, AstraZeneca, REINVENT, Bioactivity prediction

URI

https://hdl.handle.net/20.500.12380/310897

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Contrastive Learning For Molecular Representation

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By