Contrastive Learning For Molecular Representation
| dc.contributor.author | HANI, SALAM | |
| dc.contributor.author | LINDER, JONATHAN | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Bernardy, Jean-Phillippe | |
| dc.contributor.supervisor | Olsson, Simon | |
| dc.date.accessioned | 2026-01-16T07:39:21Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | This thesis explores the integration of contrastive learning into REINVENT, AstraZeneca’s in-house generative model for molecular design, with the aim of improving the model’s understanding of chemical equivalence between different SMILES representations of the same compound. To this end, a contrastive learning framework was developed, incorporating SMILES-based data augmentation techniques such as enumeration and subgraphing. The framework was evaluated on three datasets: a proprietary baseline derived from ChEMBL35, and the publicly available MOSES and GuacaMol datasets. To assess the impact of architectural design on performance, multiple model architectures were investigated, including a newly introduced intermediate architecture. Results indicate that the intermediate architecture consistently achieves higher validity across all datasets, but tends to reduce novelty. Furthermore, using multiple augmentation strategies improved the model’s ability to generate chemically diverse and novel compounds, as measured by metrics such as novelty and Fréchet ChemNet Distance (FCD). These findings suggest that contrastive learning can offer measurable benefits in de novo molecule generation, although its effectiveness may depend heavily on architecture and dataset-specific tuning. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310897 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Artificial Intelligence | |
| dc.subject | AI | |
| dc.subject | Deep Learning | |
| dc.subject | DL | |
| dc.subject | Machine Learning | |
| dc.subject | Computer Science | |
| dc.subject | Computer Engineering | |
| dc.subject | Contrastive learning | |
| dc.subject | Self-supervised learning | |
| dc.subject | Representation learning | |
| dc.subject | Data augmentation | |
| dc.subject | Embeddings | |
| dc.subject | Latent space | |
| dc.subject | Generative models | |
| dc.subject | Recurrent neural networks | |
| dc.subject | RNN | |
| dc.subject | Long Short-Term Memory | |
| dc.subject | LSTM | |
| dc.subject | Neural architecture design | |
| dc.subject | Hyperparameter tuning | |
| dc.subject | Transfer learning | |
| dc.subject | Reinforcement learning | |
| dc.subject | NT-Xent loss | |
| dc.subject | Negative log-likelihood | |
| dc.subject | Benchmarking | |
| dc.subject | PCA | |
| dc.subject | T-SNE | |
| dc.subject | UMAP | |
| dc.subject | Drug discovery | |
| dc.subject | De novo molecular generation | |
| dc.subject | In silico screening | |
| dc.subject | Molecular design | |
| dc.subject | Molecular representations | |
| dc.subject | Molecular fingerprints | |
| dc.subject | SMILES | |
| dc.subject | SMILES enumeration | |
| dc.subject | SMILES randomization | |
| dc.subject | Subgraph sampling | |
| dc.subject | Canonicalization | |
| dc.subject | Chemical space | |
| dc.subject | Physicochemical properties | |
| dc.subject | Tanimoto similarity | |
| dc.subject | Fréchet ChemNet Distance | |
| dc.subject | FCD | |
| dc.subject | Synthetic accessibility | |
| dc.subject | SA | |
| dc.subject | Quantitative Estimate of Drug-likeness | |
| dc.subject | QED | |
| dc.subject | Stereoisomers | |
| dc.subject | Stereochemistry | |
| dc.subject | Stereocenters | |
| dc.subject | Tautomerism | |
| dc.subject | Validity | |
| dc.subject | Novelty | |
| dc.subject | Diversity | |
| dc.subject | Internal diversity | |
| dc.subject | IntDiv | |
| dc.subject | ChEMBL35 | |
| dc.subject | MOSES benchmark | |
| dc.subject | GuacaMol benchmark | |
| dc.subject | Pharmaceutical AI | |
| dc.subject | Cheminformatics | |
| dc.subject | AstraZeneca | |
| dc.subject | REINVENT | |
| dc.subject | Bioactivity prediction | |
| dc.title | Contrastive Learning For Molecular Representation | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Computer science – algorithms, languages and logic (MPALG), MSc |
