Using Transformers for Chemical Toxicity Prediction

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Pollution from toxic chemicals threatens both biodiversity and human health, resulting in significant costs for society. To mitigate these impacts, chemical emission regulations, such as EC50, are employed. These regulations typically establish environmentally safe concentrations of chemicals based on data from in vivo experiments, which are usually time-consuming, expensive, and sometimes ethically problematic to conduct. As alternative means to predict chemical toxicity, previous studies have proposed computational methods (e.g., QSAR) and machine learning approaches, including transformer-based models. In one of these previous studies, a pre-trained transformer-based model was employed to predict the EC50 values of chemicals for fish. The chemical structures were represented using SMILES notation and served as the input to the model, which consisted of a RoBERTa component followed by a fully connected feed-forward neural network. The present master’s thesis builds upon this study by using the same dataset and model framework. It aims to compare the toxicity prediction performance of fine-tuned-only models with different model hyperparameters related to the model architecture and analyze the influence of these hyperparameters. In addition, the thesis aims to evaluate the impact of pre-training by comparing these models with different model hyperparameters to a pre-trained and fine-tuned ChemBERTa model. The effect of model architecture was examined only for the RoBERTa component by varying the following model hyperparameters: embedding size, number of encoder layers, and number of attention heads. The results indicated that increasing the embedding size and the number of encoder layers improved prediction performance. In contrast, no clear pattern regarding the impact of the number of attention heads on prediction performance was observed. Additionally, pre-training appeared to be necessary since the ChemBERTa-based model outperformed all non-pretrained models. These findings contribute to the development of transformer-based machine learning models for chemical toxicity prediction by indicating the optimal directions regarding model architecture and pre-training approaches. Thus, future research may include evaluating whether these findings hold for larger model hyperparameter values, as well as for other chemical representations, toxicity endpoints, and species beyond EC50 and fish.

Beskrivning

Ämne/nyckelord

toxicity prediction, chemical toxicity, EC50, transformers, ChemBERTa, SMILES, pre-training, machine learning, deep learning, neural networks

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced