Contrastive Learning For Molecular Representation
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis explores the integration of contrastive learning into REINVENT, AstraZeneca’s in-house generative model for molecular design, with the aim of improving the model’s understanding of chemical equivalence between different SMILES
representations of the same compound. To this end, a contrastive learning framework was developed, incorporating SMILES-based data augmentation techniques such as enumeration and subgraphing. The framework was evaluated on three datasets: a proprietary baseline derived from ChEMBL35, and the publicly available MOSES and GuacaMol datasets. To assess the impact of architectural design on performance, multiple model architectures were investigated, including a newly introduced intermediate architecture. Results indicate that the intermediate architecture consistently achieves higher validity across all datasets, but tends to reduce novelty. Furthermore, using multiple augmentation strategies improved the model’s ability to generate chemically diverse and novel compounds, as measured by metrics such as novelty and Fréchet ChemNet Distance (FCD). These findings suggest that contrastive learning can offer measurable benefits in de novo molecule generation, although its effectiveness may depend heavily on architecture and dataset-specific tuning.
Beskrivning
Ämne/nyckelord
Artificial Intelligence, AI, Deep Learning, DL, Machine Learning, Computer Science, Computer Engineering, Contrastive learning, Self-supervised learning, Representation learning, Data augmentation, Embeddings, Latent space, Generative models, Recurrent neural networks, RNN, Long Short-Term Memory, LSTM, Neural architecture design, Hyperparameter tuning, Transfer learning, Reinforcement learning, NT-Xent loss, Negative log-likelihood, Benchmarking, PCA, T-SNE, UMAP, Drug discovery, De novo molecular generation, In silico screening, Molecular design, Molecular representations, Molecular fingerprints, SMILES, SMILES enumeration, SMILES randomization, Subgraph sampling, Canonicalization, Chemical space, Physicochemical properties, Tanimoto similarity, Fréchet ChemNet Distance, FCD, Synthetic accessibility, SA, Quantitative Estimate of Drug-likeness, QED, Stereoisomers, Stereochemistry, Stereocenters, Tautomerism, Validity, Novelty, Diversity, Internal diversity, IntDiv, ChEMBL35, MOSES benchmark, GuacaMol benchmark, Pharmaceutical AI, Cheminformatics, AstraZeneca, REINVENT, Bioactivity prediction
