Active Learning for Surrogate Models to Augment AI-Driven Molecular Design
Typ
Examensarbete för masterexamen
Program
Publicerad
2022
Författare
JOSEFSON, CHRISTIAN
NYMAN, CLARA
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This project investigated whether an active learning (AL) framework can help mitigate
computational costs for AI-driven molecular design, without negatively impacting
accuracy. The surrogate models Random Forest (RF) and Support Vector
Regression (SVR) were tested together with the acquisition functions (AF) Random,
Thompson Sampling (TS), Tanimoto Similarity, Expected Improvement (EI), Probability
of Improvement (PI), Upper Confidence Bound (UCB) and ε−Greedy. Of
these, the combination RF and Random acquisition were concluded to perform the
best with regards to error rate, measured as root mean square error, and time consumption,
measured in runtime per epoch. SVR had slightly lower error, but took
substantially longer time. Depending on the choice of AF, one run using RF took
approximately 2-17.5 hours, while one run using SVR took approximately 100-175
hours. Four tuning parameters were introduced to see if they could further optimize
the framework. It was discovered that a longer retrain interval and a smaller
acquisition batch did not significantly impact accuracy while shortening the time
consumption. To summarise, an RF model with the Random AF with a 5 epoch
initial pooling, no warm-up phase, a retrain interval of 20 and an acquisition batch
size of 20 was selected to mitigate computational costs while simultaneously keeping
the error stable.
Beskrivning
Ämne/nyckelord
active learning , bayesian optimization , de novo design , molecular design , drug discovery , surrogate model , machine learning , molecular docking