Contrastive Learning For Molecular Representation

HANI, SALAM; LINDER,  JONATHAN

Contrastive Learning For Molecular Representation

dc.contributor.author	HANI, SALAM
dc.contributor.author	LINDER, JONATHAN
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Bernardy, Jean-Phillippe
dc.contributor.supervisor	Olsson, Simon
dc.date.accessioned	2026-01-16T07:39:21Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	This thesis explores the integration of contrastive learning into REINVENT, AstraZeneca’s in-house generative model for molecular design, with the aim of improving the model’s understanding of chemical equivalence between different SMILES representations of the same compound. To this end, a contrastive learning framework was developed, incorporating SMILES-based data augmentation techniques such as enumeration and subgraphing. The framework was evaluated on three datasets: a proprietary baseline derived from ChEMBL35, and the publicly available MOSES and GuacaMol datasets. To assess the impact of architectural design on performance, multiple model architectures were investigated, including a newly introduced intermediate architecture. Results indicate that the intermediate architecture consistently achieves higher validity across all datasets, but tends to reduce novelty. Furthermore, using multiple augmentation strategies improved the model’s ability to generate chemically diverse and novel compounds, as measured by metrics such as novelty and Fréchet ChemNet Distance (FCD). These findings suggest that contrastive learning can offer measurable benefits in de novo molecule generation, although its effectiveness may depend heavily on architecture and dataset-specific tuning.
dc.identifier.coursecode	DATX05
dc.identifier.uri	https://hdl.handle.net/20.500.12380/310897
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Artificial Intelligence
dc.subject	AI
dc.subject	Deep Learning
dc.subject	DL
dc.subject	Machine Learning
dc.subject	Computer Science
dc.subject	Computer Engineering
dc.subject	Contrastive learning
dc.subject	Self-supervised learning
dc.subject	Representation learning
dc.subject	Data augmentation
dc.subject	Embeddings
dc.subject	Latent space
dc.subject	Generative models
dc.subject	Recurrent neural networks
dc.subject	RNN
dc.subject	Long Short-Term Memory
dc.subject	LSTM
dc.subject	Neural architecture design
dc.subject	Hyperparameter tuning
dc.subject	Transfer learning
dc.subject	Reinforcement learning
dc.subject	NT-Xent loss
dc.subject	Negative log-likelihood
dc.subject	Benchmarking
dc.subject	PCA
dc.subject	T-SNE
dc.subject	UMAP
dc.subject	Drug discovery
dc.subject	De novo molecular generation
dc.subject	In silico screening
dc.subject	Molecular design
dc.subject	Molecular representations
dc.subject	Molecular fingerprints
dc.subject	SMILES
dc.subject	SMILES enumeration
dc.subject	SMILES randomization
dc.subject	Subgraph sampling
dc.subject	Canonicalization
dc.subject	Chemical space
dc.subject	Physicochemical properties
dc.subject	Tanimoto similarity
dc.subject	Fréchet ChemNet Distance
dc.subject	FCD
dc.subject	Synthetic accessibility
dc.subject	SA
dc.subject	Quantitative Estimate of Drug-likeness
dc.subject	QED
dc.subject	Stereoisomers
dc.subject	Stereochemistry
dc.subject	Stereocenters
dc.subject	Tautomerism
dc.subject	Validity
dc.subject	Novelty
dc.subject	Diversity
dc.subject	Internal diversity
dc.subject	IntDiv
dc.subject	ChEMBL35
dc.subject	MOSES benchmark
dc.subject	GuacaMol benchmark
dc.subject	Pharmaceutical AI
dc.subject	Cheminformatics
dc.subject	AstraZeneca
dc.subject	REINVENT
dc.subject	Bioactivity prediction
dc.title	Contrastive Learning For Molecular Representation
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Computer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 25-134 SH JL.pdf
Size:: 8.36 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen