ODR kommer att vara otillgängligt pga systemunderhåll onsdag 25 februari, 13:00 -15:00 (ca). Var vänlig och logga ut i god tid. // ODR will be unavailable due to system maintenance, Wednesday February 25, 13:00 - 15:00. Please log out in due time.
 

Contrastive Learning For Molecular Representation

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

This thesis explores the integration of contrastive learning into REINVENT, AstraZeneca’s in-house generative model for molecular design, with the aim of improving the model’s understanding of chemical equivalence between different SMILES representations of the same compound. To this end, a contrastive learning framework was developed, incorporating SMILES-based data augmentation techniques such as enumeration and subgraphing. The framework was evaluated on three datasets: a proprietary baseline derived from ChEMBL35, and the publicly available MOSES and GuacaMol datasets. To assess the impact of architectural design on performance, multiple model architectures were investigated, including a newly introduced intermediate architecture. Results indicate that the intermediate architecture consistently achieves higher validity across all datasets, but tends to reduce novelty. Furthermore, using multiple augmentation strategies improved the model’s ability to generate chemically diverse and novel compounds, as measured by metrics such as novelty and Fréchet ChemNet Distance (FCD). These findings suggest that contrastive learning can offer measurable benefits in de novo molecule generation, although its effectiveness may depend heavily on architecture and dataset-specific tuning.

Beskrivning

Ämne/nyckelord

Artificial Intelligence, AI, Deep Learning, DL, Machine Learning, Computer Science, Computer Engineering, Contrastive learning, Self-supervised learning, Representation learning, Data augmentation, Embeddings, Latent space, Generative models, Recurrent neural networks, RNN, Long Short-Term Memory, LSTM, Neural architecture design, Hyperparameter tuning, Transfer learning, Reinforcement learning, NT-Xent loss, Negative log-likelihood, Benchmarking, PCA, T-SNE, UMAP, Drug discovery, De novo molecular generation, In silico screening, Molecular design, Molecular representations, Molecular fingerprints, SMILES, SMILES enumeration, SMILES randomization, Subgraph sampling, Canonicalization, Chemical space, Physicochemical properties, Tanimoto similarity, Fréchet ChemNet Distance, FCD, Synthetic accessibility, SA, Quantitative Estimate of Drug-likeness, QED, Stereoisomers, Stereochemistry, Stereocenters, Tautomerism, Validity, Novelty, Diversity, Internal diversity, IntDiv, ChEMBL35, MOSES benchmark, GuacaMol benchmark, Pharmaceutical AI, Cheminformatics, AstraZeneca, REINVENT, Bioactivity prediction

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced