Using Transformers for Chemical Toxicity Prediction
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Pollution from toxic chemicals threatens both biodiversity and human health, resulting
in significant costs for society. To mitigate these impacts, chemical emission
regulations, such as EC50, are employed. These regulations typically establish environmentally
safe concentrations of chemicals based on data from in vivo experiments,
which are usually time-consuming, expensive, and sometimes ethically problematic
to conduct. As alternative means to predict chemical toxicity, previous studies have
proposed computational methods (e.g., QSAR) and machine learning approaches,
including transformer-based models. In one of these previous studies, a pre-trained
transformer-based model was employed to predict the EC50 values of chemicals for
fish. The chemical structures were represented using SMILES notation and served
as the input to the model, which consisted of a RoBERTa component followed by
a fully connected feed-forward neural network. The present master’s thesis builds
upon this study by using the same dataset and model framework. It aims to compare
the toxicity prediction performance of fine-tuned-only models with different
model hyperparameters related to the model architecture and analyze the influence
of these hyperparameters. In addition, the thesis aims to evaluate the impact of
pre-training by comparing these models with different model hyperparameters to a
pre-trained and fine-tuned ChemBERTa model. The effect of model architecture
was examined only for the RoBERTa component by varying the following model
hyperparameters: embedding size, number of encoder layers, and number of attention
heads. The results indicated that increasing the embedding size and the
number of encoder layers improved prediction performance. In contrast, no clear
pattern regarding the impact of the number of attention heads on prediction performance
was observed. Additionally, pre-training appeared to be necessary since the
ChemBERTa-based model outperformed all non-pretrained models. These findings
contribute to the development of transformer-based machine learning models for
chemical toxicity prediction by indicating the optimal directions regarding model
architecture and pre-training approaches. Thus, future research may include evaluating
whether these findings hold for larger model hyperparameter values, as well
as for other chemical representations, toxicity endpoints, and species beyond EC50
and fish.
Beskrivning
Ämne/nyckelord
toxicity prediction, chemical toxicity, EC50, transformers, ChemBERTa, SMILES, pre-training, machine learning, deep learning, neural networks
