Active Learning for Surrogate Models to Augment AI-Driven Molecular Design
Examensarbete för masterexamen
This project investigated whether an active learning (AL) framework can help mitigate computational costs for AI-driven molecular design, without negatively impacting accuracy. The surrogate models Random Forest (RF) and Support Vector Regression (SVR) were tested together with the acquisition functions (AF) Random, Thompson Sampling (TS), Tanimoto Similarity, Expected Improvement (EI), Probability of Improvement (PI), Upper Confidence Bound (UCB) and ε−Greedy. Of these, the combination RF and Random acquisition were concluded to perform the best with regards to error rate, measured as root mean square error, and time consumption, measured in runtime per epoch. SVR had slightly lower error, but took substantially longer time. Depending on the choice of AF, one run using RF took approximately 2-17.5 hours, while one run using SVR took approximately 100-175 hours. Four tuning parameters were introduced to see if they could further optimize the framework. It was discovered that a longer retrain interval and a smaller acquisition batch did not significantly impact accuracy while shortening the time consumption. To summarise, an RF model with the Random AF with a 5 epoch initial pooling, no warm-up phase, a retrain interval of 20 and an acquisition batch size of 20 was selected to mitigate computational costs while simultaneously keeping the error stable.
active learning , bayesian optimization , de novo design , molecular design , drug discovery , surrogate model , machine learning , molecular docking