AI-Based Toxicity Prediction
as an Alternative to Animal Testing
A Transformer-Based Deep Learning Approach to Toxicity Pre-
diction

Master’s thesis in Engineering Mathematics and Computational Science

Mercedes Dalman

DEPARTMENT OF MATHEMATICAL SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2023
www.chalmers.se

www.chalmers.se


Master’s thesis 2023

AI-Based Toxicity Prediction
as an Alternative to Animal Testing

A Transformer-Based Deep Learning Approach to Toxicity
Prediction

MERCEDES DALMAN

Department of Mathematical Sciences
Systems Biology and Bioinformatics

Chalmers University of Technology
Gothenburg, Sweden 2023

ii


AI-Based Toxicity Prediction as an Alternative to Animal Testing
A Transformer-Based Deep Learning Approach to Toxicity Prediction
MERCEDES DALMAN

© MERCEDES DALMAN, 2023.

Supervisors:
Erik Kristiansson, Department of Mathematical Sciences
Mikael Gustavsson, Department of Mathematical Sciences
Examiner:
Erik Kristiansson, Department of Mathematical Sciences

Master’s Thesis 2023
Department of Mathematical Sciences
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Self-attention visualised on a poem of a rat and a mouse written by OpenAI
ChatGPT [1].

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2023

iii


AI-Based Toxicity Prediction as an Alternative to Animal Testing
A Transformer-Based Deep Learning Approach to Toxicity Prediction
MERCEDES DALMAN
Department of Mathematical Sciences
Chalmers University of Technology

Abstract
In recent years, there has been a significant increase in the use of chemicals in our
environment due to growing demand and consumption. Consequently, large-scale
chemical regulation based on toxicological assays has been implemented to prevent
exposure-related consequences for nature and human health. Historically, animal-
based assays have been used for this purpose. However, there is now an increasing
demand to replace these animal-based assessment methods with computer-based
alternatives. Despite previous attempts to develop computer-based models, these
models have proven to be unreliable and inaccurate, leading to a decrease in inter-
est. Therefore, there is a pressing need to develop new computer-based models for
toxicity assessment. Here, the introduction of deep learning models, particularly
transformer architecture, has the potential to revolutionise the field. Deep neural
networks have demonstrated the ability to handle complex and high-dimensional
problems, surpassing older modelling techniques. Moreover, as the transformer has
shown promise in handling chemical structure information, there is growing interest
in its usage in the field of environmental toxicity assessment. The aim of this project
was hence to explore the potential of transformer-based deep neural network models
for the purpose of toxicity assessment.

For this project, a subset of rat and mice in vivo toxicity assay data associated
with EC50 and LOEC measurements, as well as different administration routes,
were utilised. Here, three sets of data were analysed, each distinguished by the
hazards: acute toxicity, carcinogenicity, or reproductive toxicity. The first type of
model, the single-DNN model, was created for each data set separately. Subse-
quently, these models were expanded to the multiple-DNN model, able to handle
all three data sets simultaneously. For all models, a pre-trained RoBERTa trans-
former was utilised to interpret canonicalised SMILES representation of chemical
structures, with the performance then evaluated through repeated 10-fold cross-
validation. Principal Component Analysis demonstrated that the transformer could
identify patterns in chemical structures related to toxicity. Moreover, the study
found that the single-DNN model outperformed the multiple-DNN model in all tri-
als, likely due to the latter’s increased complexity. All models exhibited leniency
towards chemicals with low measured concentrations, and to mitigate this problem,
a more stringent loss for lower concentrations was suggested. Overall, this project
demonstrated the potential and effectiveness of transformer-based computer models
for toxicity assessment, showcasing the versatility of this technology for addressing
a broad range of toxic hazards.

Keywords: environmental risk assessment, SMILES, RoBERTa, deep learning, ar-
tifical intelligence, transformer, toxicity

iv


Acknowledgements
This work would never have been possible without the help and support of the
brilliant people around me. Here, I would first like to thank my main supervisor
and examiner Erik Kristiansson, who has been a big contributor to the ideas and
knowledge that has guided me along this journey. Moreover, I would like to thank
my co-supervisor Mikael Gustavsson, who with his expertise in the world of ecotox-
icology has helped provide the basis on which this work stands upon. Furthermore,
I would especially like to thank Styrbjörn Käll for his help and engagement in this
project, and for answering all my questions. Finally, I would like to thank my friends
and family for their patience and support; without their love, it is safe to say that
this work would never have seen the daylight.

Mercedes Dalman, Gothenburg, May 2023

v


vi


Contents

List of Figures viii

List of Tables xi

1 Introduction 3
1.1 Aims and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Environmental Toxicology . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Simplified Molecular Input Line Entry System (SMILES) . . . . . . 7
2.4 Natural Language Processing and The Transformer Architecture . . . 8

2.4.1 BERT-Based Transformers . . . . . . . . . . . . . . . . . . . . 10

3 Methods 13
3.1 In Vivo Toxicological Data . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Median Loss, Best Average Loss and LossMEAN . . . . . . . . 19

4 Results 21
4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Model Performance: 10-Fold Cross-Validation . . . . . . . . . . . . . 22
4.3 Acute Toxicity Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Carcinogenicity Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Reprotoxicity Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Discussion 37
5.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii


Contents

6 Conclusion and Future Work 41

Bibliography 43

A Appendix 1 III
A.1 PCA for Carcinogenicity and Reproductive Toxicity Data Sets . . . . III

viii


List of Figures

2.1 A general example of a dose-response curve, in which values for LOEC
(Lowest Observed Effect Concentration) and EC50 (50% effect con-
centration) have been marked out. . . . . . . . . . . . . . . . . . . . . 6

2.2 A simple illustration of a Deep Neural Network, consisting of only
one input layer with two nodes, as well as an output layer with one
node parameterised by one weight for each input node (w1 and w2),
and a bias, b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 An example of a molecular structure together with its SMILES rep-
resentation. Each part of the original structure corresponding to a
certain part of the SMILES string has here been assigned the same
colour, as well as geometrically close pairs in the SMILES string. . . . 8

2.4 A simplified representation of a tokeniser connected to a RoBERTa
transformer. The input string is first segmented into predetermined
tokens (TN) by the tokeniser, before being fed to the transformer.
Then, each token is assigned an input and positional embedding
(EN). Using self-attention, the transformer adjusts these embeddings
based on the problem at hand. During this process, the CLS token is
trained, which then can be extracted as output from the transformer. 11

3.1 Relative amounts of the three different data sets in in vivo animal
test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Venn diagram of unique SMILES, as well as their overlap, in and
between in vivo animal test data sets. . . . . . . . . . . . . . . . . . . 14

3.3 Administration route distribution in in vivo animal test data. . . . . . 15

3.4 A simplified overview of the single-DNN model. The model consists
of two parts: a ChemBERTa transformer and a Deep Neural Network
(DNN). Here, the transformer receives the SMILES strings (textual
representation) of chemicals’ molecular structures as input and pro-
duces a numerical representation (CLS-token) that is then fed to the
DNN. Then, the DNN utilises the transformer output, along with
additional one-hot encoded metadata and measured log10 concentra-
tions, to predict log10 concentrations. . . . . . . . . . . . . . . . . . . 16

ix


List of Figures

3.5 Structure of the second model used in the project. The ChemBERTa
transformer is identical to what has previously been described for the
first model, and each of the three networks connected to this trans-
former also has the same structure as the single models for each data
set. Hence, the real difference between this model and the previous
lies in the fact that the three data sets can in this case be analysed
simultaneously through their own respective Deep Neural Networks
(DNN). Moreover, these networks all work independently from one
another, with the total loss (used for training) being the sum of the
losses from each network. . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Illustration of K-fold cross-validation. . . . . . . . . . . . . . . . . . . 19

4.1 Training loss average for a) single- and b) multiple-DNN models when
analysing the acute toxicity data set. . . . . . . . . . . . . . . . . . . 24

4.2 Validation best average loss, yellow, median loss, blue, and lossMEAN ,
green, average over 10 folds for the acute toxicity data set, with values
of single-DNN model to the left in each case, and multiple-DNN model
to the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Median validation residuals and predictions versus measured concen-
trations for each SMILES in 10-fold cross-validation for acute toxicity
data, with the left image corresponding to the single- and the right
to the multiple-DNN model in both cases. . . . . . . . . . . . . . . . 26

4.4 Measured vs predicted Log10[mg/kg bw] concentrations for a) the top
five worst- and b) the top 5 best-performing chemicals, with the left
plot corresponding to the single- and the right to the multiple-DNN
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Principle component analysis of CLS tokens from one fold in 10-fold
cross-validation of acute toxicity data set, with the left plot corre-
sponding to the single- and the right to the multiple-DNN model.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Training loss average for a) single- and b) multiple-DNN models when
analysing the carcinogenicity data set. . . . . . . . . . . . . . . . . . 30

4.7 Validation best average loss, yellow, median loss, blue, and lossMEAN ,
green, average with standard deviation over 10 folds for the carcino-
genicity data set, with values of single-DNN model to the left in each
case, and multiple-DNN model to the right. . . . . . . . . . . . . . . 31

4.8 Median validation residuals and predictions versus measured concen-
trations for each SMILES in 10-fold cross-validation for carcinogenic-
ity data, with the left image corresponding to the single- and the right
to the multiple-DNN model in both cases. . . . . . . . . . . . . . . . 32

4.9 Average training loss over each fold in 10-fold cross-validation for in
vivo reproductive toxicity data in a) the single- and b) the multiple-
DNN model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

x


List of Figures

4.10 Validation best average loss, yellow, median loss, blue, and lossMEAN ,
green, average with standard deviation over 10 folds for the reproduc-
tive toxicity data set, with values of single-DNN model to the left in
each case, and multiple-DNN model to the right. . . . . . . . . . . . . 34

4.11 Median validation residuals and predictions versus measured concen-
trations for each SMILES in 10-fold cross-validation for reproductive
toxicity data, with the left image corresponding to the single- and the
right to the multiple-DNN model in both cases. . . . . . . . . . . . . 35

A.1 Principle component analysis of CLS tokens from one fold in 10-fold
cross-validation of the a) the carcinogenicity data set, and b) the
reproductive toxicity set, coloured by corresponding median concen-
tration for each CLS, with the left plot corresponding to the single-
and the right to the multiple-DNN model in each case. . . . . . . . . IV

xi


List of Figures

xii


List of Tables

3.1 Variables in In Vivo Data with Categories . . . . . . . . . . . . . . . 13
3.2 Distribution of Assay Outcomes and Species in In Vivo Data . . . . . 15

4.1 Parameter Sweep Configuration Summary . . . . . . . . . . . . . . . 21
4.2 Final Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Loss Comparison Between Analysed Data Sets in Single Model . . . . 23
4.4 Top Five Chemicals With Largest Median Residuals in Acute Toxicity

Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Top Five Chemicals With Smallest Median Residuals in Acute Toxi-

city Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1


List of Tables

2


1
Introduction

The swift progress of industrial and societal advancements has led to an escalated
threat of chemical exposure to both nature and human health [2]. To mitigate this
risk, chemicals are now mandated to undergo rigorous testing for potential haz-
ardous effects, such as toxicity, prior to being introduced to the market [3], [4]. In
relation to this, the European Union’s Registration, Evaluation, Authorisation, and
Restriction of Chemicals (REACH) regulation requires companies to register chem-
icals and provide associated risk assessments, which include information on toxicity
and ecotoxicity [5]. Here, the foundation of these risk assessments lies within the
Environmental Risk Assessment (ERA), which predominantly encompasses toxico-
logical assays.

Toxicological assays are conventionally conducted through a variety of both in vivo
and in vitro animal experiments [6]. However, these tests have been found to be
both time and cost-intensive, in addition to being of ethical concern and question-
able reliability. As a response to this, there has been a growing interest in developing
alternative testing methods, with the EU advocating for the use of in silico models.
However, despite this, there has been a decline in the number of computational mod-
els used in recent years, with only a small number currently in use [4]. Moreover,
the primary cause of this reduced interest in computational models is the substan-
tial variations in output and performance due to differences in chemical structural
information handling [7].

Consequently, computational methods require further research and development to
enhance their accuracy and reliability before they can completely supplant biologi-
cal tests. Here, Artificial Intelligence (AI) has emerged as a promising prospect [3].
With its cost-effectiveness and ability to process vast amounts of intricate data, AI
can outperformed traditional modelling methods and transformed the computational
realm. Nowadays, AI is applied across various scientific fields, including biology and
medicine [8], [3]. Moreover, the introduction of the revolutionising transformer ar-
chitecture offers a potential solution to the challenge of handling chemical structural
information [9]. Against this backdrop, this thesis aims to expand on previous en-
deavours to advance the exploration and development of AI-based models for toxicity
evaluation.

3


1. Introduction

1.1 Aims and Scope
The objective of this study is to create and evaluate AI-based models that can fore-
cast mammalian chemical toxicity. Here, the data set used will include rat and mice
toxicity assays for a diverse range of administration routes but be restricted to only
EC50 and LOEC measurements. Initially, three distinct models will be developed,
each exclusively focused on predicting one of the following toxic effects: acute tox-
icity, carcinogenicity, and reproductive toxicity. Subsequently, the goal is to merge
these models into a single model that can manage all three toxic effects. More-
over, the models will be founded on a combination of transformer architecture and
a Deep Neural Network (DNN). In this system, the transformer component will be
responsible for processing and transforming chemical structural inputs in the form
of SMILES. Furthermore, the DNN component will utilise the transformer’s output,
together with additional metadata, to perform the final toxicological assessment. In
summary, the research aims addressed by this study are as follows:

• Develop and analyse a transformer-based model able to perform toxicological
predictions for one specific toxic effect.

• Develop and analyse an expanded model able to handle and predict various
toxic effects simultaneously.

This study had access to a substantial amount of mammalian data associated with
different species and outcomes. However, due to time and feasibility constraints, the
investigation was restricted to the categories: rats, mice, EC50, LOEC and a few
select administration routes. In addition, due to the same reasons, only one type of
transformer, the RoBERTa transformer, was evaluated and integrated with a basic
feed-forward neural network.

4


2

Theory

The next chapter aims to explain the theory necessary to comprehend the crucial
procedures in the project, with specific details provided as necessary. Additional
sources are cited for further information, and the methods for model building and
training procedures used in the project will be covered in the next chapter.

2.1 Environmental Toxicology
In the scientific field of environmental toxicology, both the analysis of potential
health risks, as well as the management and protection measures, associated with
hazardous chemicals are covered [10], [11], [12]. Toxic chemicals can cause acute
or chronic, such as cancer-related, health effects, with dosage being a significant
factor in what outcome exposure will have. Laboratory toxicity assessments, made
through a combination of in vivo, in vitro, and in silico testing procedures, are used
to determine hazardous properties and dosage for different chemicals. Results are
often presented as effect concentrations such as ECx (such as 50% effect, EC50) and
Lowest Observed Effect Concentration, LOEC, which indicate the concentrations
at which a certain percentage of the test population experiences health hazards. A
dose-response curve is commonly used to visualize these results, with a general ex-
ample of such a curve, where EC50 and LOEC values have been marked out, shown
in Figure 2.1.

5


2. Theory

Figure 2.1: A general example of a dose-response curve, in which values for LOEC
(Lowest Observed Effect Concentration) and EC50 (50% effect concentration) have
been marked out.

Within the EU, the REACH Regulation requires companies to conduct chemical
toxicity assessments before producing, importing or selling a chemical [13]. The Eu-
ropean Chemicals Agency (ECHA) is responsible for registering and assessing the
risks associated with the chemicals, as well as determining the need for restrictions
or bans. Moreover, the regulation obligates companies to identify and manage po-
tential hazards associated with the chemicals they market or produce by providing
guidelines for data collection and toxicity assessment.

2.2 Deep Neural Networks
Deep Learning, based on Artificial Neural Networks (ANNs), has been shown to
be a useful tool for chemical toxicity analysis [3], [14], [15]. ANNs, often in the
form of Deep Neural Networks (DNNs), consist of multiple layers of interconnected
nodes, forming a complex web of connections. Here, the connection between neu-
rons is parameterised by weights and biases, making it essentially a weighted sum.
Additionally, an activation function at each layer re-scales the signal and thereby
determines to what degree the signal should be passed on to the next layer. The opti-
misation of this system, achieved through supervised learning with gradient descent,
involves updating the weights and biases through backpropagation to minimise the
error between the predicted outcomes and the measured values found in the data.

DNNs can handle large and complex data better than traditional regression tech-
niques. However, making a DNN too large can lead to it becoming too fine-tuned to
the training data, so-called overfitting, and poor performance on unseen data [16].
Hence, the performance of a DNN depends heavily on hyperparameters like the

6


2. Theory

number of hidden layers and neurons in each layer. To address this, dropout, where
a percentage of neurons in the model are inactivated, and freezing some layers dur-
ing training can reduce the model’s sensitivity to training data. An example of a
simple DNN with only an input and output layer can be seen in Figure 2.2.

Figure 2.2: A simple illustration of a Deep Neural Network, consisting of only one
input layer with two nodes, as well as an output layer with one node parameterised
by one weight for each input node (w1 and w2), and a bias, b.

2.3 Simplified Molecular Input Line Entry Sys-
tem (SMILES)

Chemicals’ properties and their potential health hazards are related to their molec-
ular structures, making efficient utilization of structural information critical for in
silico methods for toxicity prediction [9]. Moreover, to process molecular structures
computationally, they must be represented in a 1D sequential format while retaining
important structural information found in their 3D form. Here, Simplified Molec-
ular Input Line Entry Systems (SMILES), a sequence of letters and symbols that
represent a molecular structure, are commonly used for small structures, as they
are designed in such a way that they contain information on the 3D aspects of the
original molecular structure they represent. Figure 2.3 provides an example of a
structure with its corresponding SMILES representation [17]. In the figure, colour
has been used to demonstrate which parts of the original 3D structure correspond
to a certain element in the SMILES string, as well as geometrically close pairs in
the structure.

7


2. Theory

Figure 2.3: An example of a molecular structure together with its SMILES rep-
resentation. Each part of the original structure corresponding to a certain part of
the SMILES string has here been assigned the same colour, as well as geometrically
close pairs in the SMILES string.

An issue is that SMILES representations can have multiple versions for the same
chemical, depending on how the atoms have been numbered in the structure [8].
This problem is usually resolved by running SMILES through canonicalisation algo-
rithms, which always follow the same specific rules for generating SMILES. However,
different databases have their own versions of these algorithms, making it impor-
tant to use the same algorithm to ensure the uniqueness of SMILES for a specific
chemical.

2.4 Natural Language Processing and The Trans-
former Architecture

Computational scientists historically faced an issue with computers’ inability to
process textual inputs like SMILES [9]. To combat this, Natural Language Pro-
cessing (NLP) was introduced, where recently the Transformer architecture has
revolutionised the field with recent technological breakthroughs such as GPT [18].
Transformers rely on the self-attention mechanism to learn and process text into a
high-dimensional numerical output processable in a neural network [9]. Contrary
to older NLP algorithms where text was processed in a sequential manner, lead-
ing to issues both with speed and memory usage, transformers utilise the semantic
meaning found in the geometrical distances between elements in an input string.
This makes them suitable for processing SMILES, where the elements in the string
correspond to actual 3D elements in molecular structures, and the distances between
these elements are important.

More specifically, transformers are composed of encoders and decoders, with the
encoder performing input encoding and the decoder predicting the most probable

8


2. Theory

translation [19], [20]. Input strings are pre-processed by a tokenizer, which split
them into smaller elements, or tokens, according to some pre-existing vocabulary,
and each token is then given corresponding input and positional embeddings before
being transformed by the encoder. The task of the encoder is then to shift, or trans-
form, these input-embeddings in a manner which takes context (such as if a certain
word/token is more important for understanding the meaning of the sentence than
others) into consideration. Here, the positional embeddings provide context for the
encoder’s interpretation, as input embeddings lack positional information. The out-
put embeddings, unique for each token, are primed for the decoder’s final translation.

The encoder in the transformer system uses self-attention to take context into con-
sideration [19], [20], [9]. Self-attention computes the relative importance of input
embeddings to one another based on their spatial relationship or distance. For ex-
ample, for an input sentence, some words will be more closely related to each other
than others (e.g. words describing nature, "tree", "river", "soil"). The closeness in
the relationship between these words will translate to their input embeddings lying
closer in high-dimensional space. Furthermore, the self-attention mechanism then
essentially computes the dot product between each input embedding, meaning that
the embeddings which lie close in high dimensional space will be associated with a
high dot product. More specifically, each high-dimensional input token is divided
into query, key, and value vectors to capture unique aspects of the input token.
The dot product between each token’s key, query, and value vector is computed
to determine the final weight or attention between each token. To get these key,
query, and value vectors, each input embedding is multiplied with the respective
weight matrices WK , WQ, and WV . Then the final weight (or attention) calculated
between each token is described by a function, see Equation 2.1, of all three of
these dot products [9]. Finally, the resulting output is passed through a softmax
function whose task is to crush small and negative values to 0 (indicating distant
or no relationship between tokens), whilst inflating large values (indicating a strong
relationship between tokens).

[H]Attention(Q, K, V ) = softmax(QK2
√

dk

)V (2.1)

Multi-head self-attention is a technique used to improve the efficiency of the self-
attention mechanism in transformers [19], [20]. It involves splitting the input into
multiple sets of key, query, and value vectors, with each set being associated with its
own weight matrices that capture unique aspects of the input. These multiple heads
work in parallel and each performs the self-attention process described before, before
finally concatenating their outputs back into vectors of the same sizes as the original
input embeddings. Moreover, this technique allows transformers to find extremely
complex relationships in their inputs. Additionally, in modern transformers, several
encoders are stacked together in encoder blocks to further enhance their ability to
capture intricate relationships.

9


2. Theory

2.4.1 BERT-Based Transformers

Low-quality data can significantly affect the output and usefulness of both tradi-
tional and AI-based models. Deep Learning models often require labelled data, but
large datasets in fields such as biology may have limited labelled data, reducing
the performance of traditional transformers [21], [22], [23]. To address this issue,
different classes of transformers have been developed for specific tasks, including
those focused on handling unlabelled data. Here, the Bidirectional Encoder Repre-
sentation from Transformers, or BERT, based on masked-language modelling, has
become popular in text classification and language analysis due to its bidirectional
pre-training and fine-tuning system, allowing it to handle a broad range of problems.

The BERT model’s ability to handle large unlabelled datasets is due to its unsuper-
vised, bidirectional pre-training [21], [22], [23]. Here, BERT uses Masked Language
Modelling (MLM) to give the model a general understanding of the language of inter-
est, through a percentage of tokens being randomly masked, and BERT then being
tasked with predicting the masked word from surrounding words. In pre-training,
Next Sentence Prediction (NSP) is also used to understand the relationship be-
tween sentences in the input data. Once pre-training is complete, BERT is ready
for fine-tuning, where it becomes fine-tuned to a specific task/problem involving the
language at hand. Fine-tuning is done through supervised learning, but BERT’s un-
derstanding of the language from pre-training means that only a small set of labelled
data is required. Through fine-tuning, BERT can perform a wide range of tasks,
such as text prediction, summarization, and text generation. BERT also implements
large-scale parallelization to make use of large amounts of data within an efficient
time frame. When using BERT-based transformers, inputs are tokenized and a CLS
(classification) token is added as the first token. Moreover, the tokens are then trun-
cated or padded to a fixed size and matched to their associated embeddings. The
transformer is then trained through multi-head self-attention, as mentioned previ-
ously, but without using decoders. Specific to BERT is the training of an additional
CLS token that summarizes all the information BERT learns through fine-tuning,
which can be used as a replacement for all the output tokens of a sentence.

Since the introduction of BERT, improvements have been made to the model, in-
cluding the development of the Robustly Optimized BERT Approach, or RoBERTa,
to address speed-related issues in pre-training [24]. RoBERTa eliminates the Next
Sentence Prediction (NSP) technique used in BERT while outperforming it in terms
of performance. RoBERTa is used in various NLP tasks, including the ChemBERTa
model, which achieved a vocabulary size of about 8 thousand tokens through pre-
training with large datasets of canonicalized SMILES, which were canonicalized
using rdkit’s canonicalization algorithm [25]. In this project, a pre-trained Chem-
BERTa will be used instead of training a new RoBERTa model. Finally, a general
illustration of a RoBERTa transformer, including tokenization and CLS-token ex-
traction, is shown in Figure 2.4.

10


2. Theory

Figure 2.4: A simplified representation of a tokeniser connected to a RoBERTa
transformer. The input string is first segmented into predetermined tokens (TN) by
the tokeniser, before being fed to the transformer. Then, each token is assigned an
input and positional embedding (EN). Using self-attention, the transformer adjusts
these embeddings based on the problem at hand. During this process, the CLS
token is trained, which then can be extracted as output from the transformer.

11


2. Theory

12


3
Methods

Presented in this section are the methods used together with information specific to
the implementation of these methods. However, the purpose of this section is not
to provide details on or understanding of specific concepts. For this, the reader is
instead directed to the previous chapter.

3.1 In Vivo Toxicological Data
In this project, the in vivo data used, especially that from the RTECS dataset, is
characterized by a large number of associated variables. These variables include
various species, assay outcomes, and administration routes. hence, to simplify the
analysis, only a limited subset of this data was used. Subsequently, the relevant
variables and categories for this subset are listed in Table 3.1.

Table 3.1: Variables in In Vivo Data with Categories

Variable Categories
Species rat, mouse

Administration Routes intraperitoneal, oral, intravenous, subcutaneous,
dermal, intracerebral, intramuscular, parenteral,

intratracheal, intraspinal, implant, other routes
Assay Outcomes EC50, LOEC

Data Sets acute toxicity, carcinogenicity, reproductive toxicity

Notably, the "other routes" category in the administration route variable in the in
vivo data used contains chemical routes of administration with only rare occurrences.
The assay data, in its entirety, also includes a variety of concentrations in different
units, however only those that can be converted to mL/kg body weight were kept
during pre-processing. Moreover, although experiment duration was considered a
potentially significant variable, it was not included due to the lack of available data.
All data points in the in vivo data are associated with a canonicalised SMILES string
obtained through the rdkit canonicalization function, as well as a CAS number. Fur-
thermore, the distribution of data in the three different in vivo toxicological datasets
used in this project varies significantly. The majority of the data, approximately
88%, corresponds to acute toxicity measurements, whereas reproductive toxicity and
carcinogenicity tests only make up about 9% and 3% of the data, respectively. This
information is visualized in Figure 3.1.

13


3. Methods

Figure 3.1: Relative amounts of the three different data sets in in vivo animal test
data.

Moreover, the Venn diagram in Figure 3.2, depicting the number of unique SMILES
in each data set, along with the overlap of unique SMILES between data sets, shows
that as well as being the largest row-wise, the acute toxicity data set also dominates
the other data sets in terms of unique SMILES. About 78,000 unique SMILES are
found exclusively in the acute toxicity data set, whilst the other two data sets only
have about 1,600 to 1,700 SMILES uniquely attributed to them.

Figure 3.2: Venn diagram of unique SMILES, as well as their overlap, in and
between in vivo animal test data sets.

The prevalence of certain species, administration routes, and assay outcomes in the
in vivo data used in this project also varies a lot between data sets, where the number
of SMILES associated with either the assay outcome EC50 or LOEC, as well as either

14


3. Methods

the species mouse or rat, is presented in Table 3.2. In the table, it can be seen that
most of the data are of type EC50 and mouse. However, notably, the carcinogenicity
and reproductive toxicity data sets are dominated by LOEC data, with the former
data set only having data of this type, meaning that it is the acute toxicity data set
that contributes to the majority of all data being EC50. For the species variable, rats
dominate in the reproductive toxicity data set, whilst the distribution of chemicals
between the species is almost equal for carcinogenic data. Once again, it is therefore
due to the acute toxicity data set, and its size, that the data overall are of mostly
mouse assays.

Table 3.2: Distribution of Assay Outcomes and Species in In Vivo Data

Parameter Acute Toxicity Carcinogenicity Reproductive
Toxicity

Total

EC50 92,823 0 20 92,843
LOEC 20,689 4,257 11,350 36,296
Mouse 81,143 2,135 3,252 86,530

Rat 32,369 2,122 8,118 42,609

Furthermore, the distribution of administration routes across the entire data is
shown in Figure 3.3, where it can be seen that intraperitoneal, oral, and intra-
venous administration routes dominate, accounting for 34, 30, and 18% of the data,
respectively.

Figure 3.3: Administration route distribution in in vivo animal test data.

15


3. Methods

3.1.1 Pre-processing
For in vivo toxicological assays, all data used in this project comes from the RTECS
and REACH databases, and to create the final data sets used in the project data
from both databases were mixed randomly. Moreover, to reduce variance in con-
centration, the Log10 transformation was then applied to the measurements. To
ascertain that there was not any variation in SMILES for the same chemicals, all
SMILES were run through RDKit’s canonicalization algorithm. Finally, all categori-
cal variables (such as species, administration route and assay outcome) were one-hot
encoded.

3.2 Architecture
The models made in this project employ two main components: a Natural Language
Processing (NLP) transformer and Deep Neural Networks (DNNs). The first type
of model which was built was the so-called single-DNN model is illustrated in Figure
3.4. Here, one ChemBERTa transformer is connected to one DNN associated with
a specific data set, such as acute toxicity.

Figure 3.4: A simplified overview of the single-DNN model. The model consists of
two parts: a ChemBERTa transformer and a Deep Neural Network (DNN). Here,
the transformer receives the SMILES strings (textual representation) of chemicals’
molecular structures as input and produces a numerical representation (CLS-token)
that is then fed to the DNN. Then, the DNN utilises the transformer output, along
with additional one-hot encoded metadata and measured log10 concentrations, to
predict log10 concentrations.

Figure 3.5 depicts an overview of the second type of model, the so-called multiple-

16


3. Methods

DNN model, used in this project. Here, it can be seen that the ChemBERTa trans-
former instead is connected to several DNNs, each corresponding to their specific
data set. Moreover, these DNNs work independently of one another, with the trans-
former being fine-tuned through backpropagation with the combined loss, or error,
from each of these separate DNNs.

Figure 3.5: Structure of the second model used in the project. The ChemBERTa
transformer is identical to what has previously been described for the first model,
and each of the three networks connected to this transformer also has the same
structure as the single models for each data set. Hence, the real difference between
this model and the previous lies in the fact that the three data sets can in this case
be analysed simultaneously through their own respective Deep Neural Networks
(DNN). Moreover, these networks all work independently from one another, with
the total loss (used for training) being the sum of the losses from each network.

In the case of both the single- and multiple-DNN model, the metadata consists
of the categorical variables "species", "administration route", and "endpoint" (also
called assay output), which has all been one-hot encoded, a CLS-token representa-
tion of a SMILES, and the 10th logarithm of the measured concentration for the
specific assay. Moreover, the transformer used is a pre-trained ChemBERTa (see
Theory), which task is to transform the molecular structure of a chemical, repre-
sented as a SMILES, into numerical output, the CLS token. Before being fed to
the transformer, the SMILES are tokenised by a tokeniser provided by Hugging-
face [26]. The tokeniser used here has been pre-trained on SMILES, and is based on
Huggingface’s Byte-Pair. Finally, in the DNN, the input is passed through several
hidden layers utilising ReLU activation functions, until it reaches a final single linear

17


3. Methods

output neuron, with the final output being the logarithm of the concentration for
the desired toxicological outcome.

3.3 Training
In order to create the models used in the project, a pre-trained ChemBERTa trans-
former was downloaded from the Transformers library on Huggingface, whilst the
DNN was created using PyTorch v1.9.0 [27]. Moreover, to train the DNN, the L1 loss
function (or the Mean Absolute Error (MAE)), was used, together with the stochas-
tic gradient descent optimiser AdamW. A number of hyperparameter sweeps were
then used to set the DNN’s learning rate, dropout probability, number of hidden
layers and number of hidden neurons. These hyperparameter sweeps were performed
using the so-called Bayesian hyperparameter search [28]. This optimisation method
utilises a Bayesian approach, in which, compared to grid searches where all possible
combinations of hyperparameter values are tested, the choice in parameter settings
for a specific run is influenced by information gained in previous runs (that is, which
settings that gave performance). Hence, the settings tested will be the ones which
have the highest probability of being good choices. In the case of Bayesian searches,
there is no obvious endpoint for the search (compared to a grid search which ends
when all combinations have been tested), and in this case, the decision was made to
end the search when around 400 runs had been performed. Moreover, all training
was performed on Alvis OnDemand using A100 GPUs and logged on the machine
learning platform Weights & Biases. Also, a linear warmup with 100 warmup steps,
during which the learning rate was linearly increased from a low level to the set level,
was implemented in each training session to reduce some early learning volatility [29].

Furthermore, in training the model, a 10-fold cross-validation method was employed,
and a sequential sampler was used for both the training and test sets. A weighted
random sampler was also considered for the training set due to some imbalance in
the data, such as in terms of species and assay outcome. However, the use of a
weighted random sampler did not significantly improve the model’s performance,
and a sequential sampler was necessary when creating a model with multiple DNNs.
Therefore, the decision was made to use a sequential random sampler. Moreover,
for the multiple-DNN model, all three data sets’ test and training sets were con-
catenated after each data sets individual test-train split (where each data set was
randomly split into 10 folds of equal size). As the acute toxicity data set is much
larger than the other two, special care was needed here to make sure that the smaller
sets were always represented in the training set. Specifically, the carcinogenicity
and reproductive toxicity training sets were upsampled in such a way that they each
would constitute 25% of the total, concatenated training set (before upsampling).
This upsampling was performed by randomly drawing, with replacement, SMILES
from the training sets. After the two training sets had been upsampled, they were
concatenated with each other and the acute toxicity training set. Finally, the entire
training set, as well as the test set, were shuffled.

18


3. Methods

3.3.1 K-Fold Cross-Validation
Figure 3.6 shows the k-fold cross-validation technique, which involves randomly
dividing the data set into k equal folds [30]. For the case of this project, the data
sets were split so that there was no SMILES overlap between the test and training
sets. After the split, one fold is used as the test set, and the other k-1 folds are used
for training. The model’s final performance is evaluated by predicting the test set,
resulting in a validation performance/loss for this fold. This process is repeated until
all folds have been used as the test set once, resulting in k validation performance
calculations. The number of folds used is typically 5-10, and in this project, 10 folds
were used. Moreover, to reduce the potential for bias introduced during the initial
data split and fold assignment, the k-fold cross-validation was repeated 10 times for
each test, and the final model performance assessment was calculated as the average
of these performances.

Figure 3.6: Illustration of K-fold cross-validation.

3.3.2 Median Loss, Best Average Loss and LossMEAN

During training, it was important to avoid bias introduced by having the same chem-
ical appear multiple times in the test set. If a chemical appeared multiple times, it
would have a larger impact on the average loss than a chemical that only appeared
once. To address this issue, the concept of the best average loss was implemented.
To calculate the best average loss, the median loss for each unique chemical in the
validation set was calculated at each epoch. By taking the median for each unique
chemical, the output was less influenced by certain chemicals appearing often in the
data. The mean was then taken of all the median values to achieve the total average
loss for the epoch, and the lowest total average loss achieved during the entire run
was considered the best average loss.

Moreover, the lack of reference for losses was a problem during the project. With-
out a reference, it was difficult to determine what constituted a low value for losses.
To address this issue, the concept of lossMEAN -value was implemented. This value

19


3. Methods

represents the loss if the model were to predict the mean of the measured concen-
trations of the training set for all samples. If the model were to predict only this
mean value, it would be equivalent to making random guesses for all predictions,
indicating that the model has not learned anything from the input data.

20


4
Results

This section aims to provide an overview of the project results. The first part
presents the final settings for the model architecture. The next section focuses on
the model’s performance results, presented in the order of acute toxicity, carcino-
genicity, and reproductive toxicity. As two models were used, one with a single deep
neural network and one with all deep neural networks connected to a transformer,
a comparison of their performance is included in this section.

4.1 Architecture
To decide the project’s model structures, hyperparameter sweeps were conducted,
which tested combinations of different parameter values of interest. Here, initially,
acute toxicity data were used to perform the hyperparameter sweeps in a single-DNN
model setting. Later, the process was repeated using carcinogenicity data, yielding
the same results as the first case. Time constraints prevented additional hyperpa-
rameter sweeps for the expanded (multiple-DNN) model or reproductive toxicity
data, so the same parameter settings found in the initial sweeps were used for all
models in the project. Finally, or more information on the methods used, please
refer to the Methods section.

The table listed as Table 4.1 shows the parameters and values that were tested
in the hyperparameter sweeps. Here, the number of hidden layers in the model was
determined by testing for one, two, and three layers, and the case that resulted in
the best outcome (three hidden layers) was chosen. Some parameters were tested
using ranges of values, which was possible due to the Bayesian approach used in the
sweeps (see Theory for more details). This allowed for the testing of combinations
of parameters until a stopping point was reached (around 400 combinations tested
for each sweep).

Table 4.1: Parameter Sweep Configuration Summary

Parameter Tested Values Chosen Values
Learning Rate 1*10−2 - 2*10−6 1*10−5

Dropout 0.2, 0.3 0.3
Hidden Layers Size 1 100-900 300
Hidden Layers Size 2 100-900 500
Hidden Layers Size 3 100-900 700

21


4. Results

Instead of using hyperparameter sweeps, the batch size and epoch number param-
eters were determined based on previous knowledge and trial and error. Here, the
epoch number was increased up to 80 since the loss continued to decrease until that
point, while the batch size was set to 256, a commonly used value that provided good
performance. Freezing some encoder layers of the transformer was also tested but
didn’t improve performance. To prevent early overfitting and hasten convergence, a
warm-up period of 100 steps, selected through trial and error, was added through a
linear scheduler (see Methods). Finally, the final parameter configurations used for
all models in the project are available in Table 4.2.

Table 4.2: Final Parameter Settings

Parameter Values
Learning Rate 1*10−5

Batch Size 256
Number of Epochs 80

Dropout 0.3
Hidden Layers Size 1 300
Hidden Layers Size 2 500
Hidden Layers Size 3 700

4.2 Model Performance: 10-Fold Cross-Validation
Presented in this section are the results of the 10-fold cross-validations used to
evaluate the model’s performance in both single- and multiple-DNN forms. As
mentioned in Methods, the cross-validations were repeated 10 times, and each split
was associated with a unique random seed to reduce uncertainty. Subsequently,
the tables and figures presented show the fold average over the 10 runs for each
fold, and the values shown represent the final losses achieved at the end of a run
after 80 epochs of training. The upcoming subsections provide a detailed analysis
of the model’s performance on and results for each in vivo data set separately,
starting with acute toxicity, followed by carcinogenicity and reproductive toxicity.
Moreover, the analysis will also compare the outputs of both single- and multiple-
DNN models for each case. Here, the acute toxicity data set will be subjected to a
more comprehensive analysis, while for the other two data sets, a detailed analysis
will not be performed due to time and length constraints, assuming similar effects
to those affecting acute toxicity data to affect their results (however, see Appendix
1 for some of the results related to the latter two data sets).
Table 4.3 lists the total fold average of the median, best average, and lossMEAN (see
Methods) for each of the data sets acute toxicity, carcinogenicity, and reproductive
toxicity. Here, the results are shown separately for the single- and multiple-DNN
models. It can be seen that the best average loss is consistently lower than the
median loss, and the acute toxicity data set performs better overall than the other
two data sets. Furthermore, the multiple-DNN model performs worse than the
single-DNN model in all cases.

22


4. Results

Table 4.3: Loss Comparison Between Analysed Data Sets in Single Model

Fold Average Measurement Acute toxicity Carcinogenicity Reprotoxicity
Median Loss Single 0.273 0.543 0.688
Median Loss Multi 0.308 0.610 0.763

Best Average Loss Single 0.238 0.455 0.60
Best Average Loss Multi 0.260 0.513 0.627

LossMEAN Single 0.783 0.955 1.076
LossMEAN Multi 0.783 0.955 1.0882

4.3 Acute Toxicity Data Set
4.3.1 Model Performance

Figure 4.1 shows the training loss curve (in Log10[mg/kg bw] concentration) average
for each of the 10 folds in the 10-fold cross-validation of the acute toxicity data set,
plotted against the step count of the AdamW optimizer, for a) the single- and b)
the multiple-DNN model. For both a) and b), a continuous decrease in loss is ob-
served throughout each run, corresponding to the model learning. Overall, the two
curves look similar, though the loss might be slightly higher for the multiple-DNN
model. Potentially, in both cases, the loss is still decreasing at the end of each run,
indicating that there is still more to learn in the training data at this point. The
figure also indicates that there are some significant differences in output within the
folds (seen by the large error margins), but that the averages are almost the same
for all 10 cases. At the end of each fold’s run, the lowest average training loss is
achieved at about 0.3 in both a) and b).

23


4. Results

a) Training loss average for each fold in 10-fold cross-validation for acute toxicity
data in single-DNN model.

b) Training loss average for each fold in 10-fold cross-validation for acute toxicity
data in multiple-DNN model.

Figure 4.1: Training loss average for a) single- and b) multiple-DNN models when
analysing the acute toxicity data set.

In Figure 4.2, the median, in blue, and best average loss average, in yellow, for
each fold (achieved at the end of each run) have been plotted together in a bar
plot with the average lossMEAN , in green, reference value for each of the 10-folds.
Moreover, the results from both the single- and multiple-DNN model have been
plotted together in such a way that the leftmost bar for each loss-value corresponds
to the single-DNN model, whilst the rightmost bar corresponds to the multiple-DNN
model. Here, it can be seen that the average values do not change much between
folds, with the LossMEAN staying around 0.8 in both cases. Moreover, the median

24


4. Results

and best average losses are much lower than the reference lossMEAN value, being
at around 0.3 and 0.25 respectively for both the single- and multiple-DNN case,
indicating that the model is successfully finding patterns in the data. Notably, the
best average loss values are lower than the median loss values for each fold.

Figure 4.2: Validation best average loss, yellow, median loss, blue, and lossMEAN ,
green, average over 10 folds for the acute toxicity data set, with values of single-DNN
model to the left in each case, and multiple-DNN model to the right.

4.3.2 Model Results
Illustration a) in Figure 4.3 shows the residuals from the 10-fold (repeated 10 times)
cross-validation of the acute toxicity data set, whilst illustration b) instead depicts
the predicted concentrations plotted against the measured concentrations found in
the data, all in Log10[mg/kg bw]. In both cases, the values correspond to the
median result for each SMILES in the data set, and in the second image, the red
line corresponds to a linear fit to the data, whilst the black line is the perfect one-to-
one fit. Moreover, the leftmost image in both cases corresponds to the single-DNN
model, whilst the rightmost corresponds to the multiple-DNN model. For both the
single- and multiple-DNN model, the residual plot, a), shows that most residuals
lie relatively close to 0, except for a few notable chemicals with significantly larger
residuals than the rest. Here, the result is strikingly similar between the two models,
indicating that the models operate in much the same manner. For b), it can be seen
that both models have a tendency to be too lenient on very low concentrations,
predicting much higher concentrations for these cases than the measured data.

25


4. Results

a) Median validation residuals in Log10[mg/kg bw] for each SMILES.

b) Median validation predicted versus measured concentrations in Log10[mg/kg bw]
for each SMILES.

Figure 4.3: Median validation residuals and predictions versus measured concen-
trations for each SMILES in 10-fold cross-validation for acute toxicity data, with
the left image corresponding to the single- and the right to the multiple-DNN model
in both cases.

Table 4.4 contains the top five worst-performing chemicals in the acute toxicity data
set, with their frequency of occurrence, mean concentration, predicted concentration,
and absolute error in log10 scale. Here, the upper part of the table represents the
results obtained from the single-DNN model, while the lower part represents those
from the multiple-DNN model, with the same five chemicals appearing in both
cases, albeit in a different order. The chemical names were identified by searching
the corresponding SMILES on the PubChem database [31]. The table shows that
the measured concentrations for these chemicals were much lower than the average,
which may indicate that small doses of these chemicals were sufficient to produce the
desired effect. However, the predicted concentrations for all these chemicals were
much higher than the actual measured values, resulting in high absolute errors.

26


4. Results

Table 4.4: Top Five Chemicals With Largest Median Residuals in Acute Toxicity
Data Set

Single-DNN Model
Name Measured Conc. Predicted Conc. Absolute Error

rizatriptan -8.00 1.59 9.59
JWH-015 -5.69 2.29 7.60

14-Methoxymetopon -7.74 -0.36 6.59
SA4503 -4.64 1.97 6.57

Beta-Carotene -5.30 0.997 6.30
Multiple-DNN Model

Name Measured Conc. Predicted Conc. Absolute Error
rizatriptan -8.00 2.14 10.14
JWH-015 -5.69 2.47 7.85

Beta-Carotene -5.30 1.75 7.05
SA4503 -4.64 2.23 6.85

14-Methoxymetopon -7.74 -0.21 6.45

A corresponding table to the one above for the top five best-performing chemicals in
the acute toxicity data set can be viewed in Table 4.5. In this table, notably, none of
the chemicals except for carbon dioxide has common names and has therefore been
addressed mostly by their CAS numbers. As in the previous table, the upper part
of the table represents the results obtained from the single-DNN model, while the
lower part represents those from the multiple-DNN model. Also notable, compared
to the previous table, is that here all measured concentrations are much higher, and
the same chemicals do not occur in the single- and multiple-DNN models.

Table 4.5: Top Five Chemicals With Smallest Median Residuals in Acute Toxicity
Data Set

Single-DNN Model
Name Measured Conc. Predicted Conc. Absolute Error

103687-05-4 2.83 2.83 0.010
52582-90-8 2.18 2.18 0.011
15913-41-4 2.40 2.39 0.012
115091-87-7 2.39 2.38 0.014
73927-34-1 2.88 2.88 0.014

Multiple-DNN Model
Name Measured Conc. Predicted Conc. Absolute Error

88770-63-2 1.34 1.34 0.009
52994-61-3 2.30 2.30 0.012

197039 0.67 0.66 0.014
23905-05-7 2.90 2.89 0.016

Carbonic Acid 2.18 2.17 0.016

In Figure 4.4, all predictions in the entire 10-fold cross-validation (repeated 10 times)
for a) the top five worst- and b) the top five best-performing chemicals have been

27


4. Results

plotted against their measured concentrations in the data, as well as coloured accord-
ing to which chemical the points belong, with the left plot in the figure corresponding
to the single- and the right to the multiple-DNN model. For the worst-performing
chemicals, in a), it is possible to see that there are large within-chemical-variation
of the predictions, as well as between the different measured concentrations (which
are all relatively low). Moreover, only 14-Methoxymetopon, SA4503 and JWH-015
have been measured more than once, with measured concentrations showing a large
variation in value. On the contrary, the best-performing chemicals, in b), only occur
once in the data set, and all have measured concentrations ranging between 0.5 to
3 Log10[mg/kg bw].

a) Measured vs predicted Log10[mg/kg bw] concentrations for top five worst per-
forming chemicals.

b) Measured vs predicted log10(concentrations) for top five best performing chemi-
cals.

Figure 4.4: Measured vs predicted Log10[mg/kg bw] concentrations for a) the top
five worst- and b) the top 5 best-performing chemicals, with the left plot correspond-
ing to the single- and the right to the multiple-DNN model.

Finally, a Principle Component Analysis (PCA) was also performed on the CLS
tokens (corresponding to unique SMILES) in the validation set data. Figure 4.5
showcases the first two components plotted against each other from a PCA done on
one fold of the validation data, with the left plot in the figure corresponding to the
single- and the right to the multiple-DNN model. In the case of both models, the first
component (corresponding to the x-axis) has captured around 23% of the variation

28


4. Results

in the data, whilst the second component (corresponding to the y-axis) has captured
around 15% for the single- and around 13% of the variation in the multiple-DNN
model. The PCA plots have been coloured by the median measured concentration
for each SMILES. For both models, the measured concentration seems to vary along
a gradient which largely corresponds to the separation captured by mostly the first
but also partly the second PCA component. Here, it can be observed that SMILES
associated with higher measured concentrations (indicated by yellow) lie further to
the left in the plots, whilst SMILES associated with lower concentrations, in light
blue/purple, instead lie further up and to the right of the plots.

Figure 4.5: Principle component analysis of CLS tokens from one fold in 10-fold
cross-validation of acute toxicity data set, with the left plot corresponding to the
single- and the right to the multiple-DNN model.

4.4 Carcinogenicity Data Set
4.4.1 Model Performance
In Figure 4.6, similarly to that of Figure 4.1 above, the average training loss (in
log10 concentration) for each fold in the carcinogenicity data set has been plotted
against the steps taken by the optimiser, with a) corresponding to the single- and
b) to the multiple-DNN model. As in the case of acute toxicity, it seems that the
fold average losses lie close to each other for both models, whilst the large shadowed
areas indicate that there are some significant differences in output within the folds
themselves. However, in this case, the second model seems to show more variation
between folds, with some average fold losses even increasing before decreasing with
the rest. Also, as observed for the acute toxicity case, it seems that the average
losses, which reach an overall minimum of around 0.3-0.4 for both models, are still
decreasing slightly at the end of the runs, indicating that there is still more to learn
in the data.

29


4. Results

a) Training loss average for each fold in 10-fold cross-validation for carcinogenicity
data in single-DNN model.

b) Training loss average for each fold in 10-fold cross-validation for acute toxicity
data in multiple-DNN model.

Figure 4.6: Training loss average for a) single- and b) multiple-DNN models when
analysing the carcinogenicity data set.

Like in Figure 4.2 for acute toxicity, Figure 4.7 below depicts a bar plot of the
average median loss, in blue, and best average loss, in yellow, plotted together with
the average reference lossMEAN , in green, and their respective standard deviations,
for each fold in the carcinogenicity data. Here, the leftmost bar for each value
corresponds to the single- and the rightmost to the multiple-DNN model. Compared
to the acute toxicity data set, there here seems to be slightly more variation in output
between the different folds for both models. As in the case of acute toxicity, however,
the best average loss for each fold is in all cases lower than the median loss.

30


4. Results

Figure 4.7: Validation best average loss, yellow, median loss, blue, and lossMEAN ,
green, average with standard deviation over 10 folds for the carcinogenicity data set,
with values of single-DNN model to the left in each case, and multiple-DNN model
to the right.

4.4.2 Model Results

Figure 4.8 shows a) the residuals and b) the predictions plotted against measured
concentration, in both cases taken as the median for each SMILES and in log10
concentration, of the single-DNN model, to the left, and the multiple, to the right.
As in the case of the acute toxicity data set, it can be seen in a) that most residuals lie
relatively close to 0, with some chemicals performing much worse than the rest. On
the other hand, unlike the acute toxicity case, b) indicates that the carcinogenicity
case has a less obvious tendency to predict high concentrations for chemicals with
the lowest measured concentrations. However, there are still cases where the model
has predicted much lower or higher concentrations than the measured values. In
both a) and b), it can be seen that the single- and multiple-DNN models seem to
perform very similarly.

31


4. Results

a) Median validation residuals in Log10[mg/kg bw] for each SMILES.

b) Median validation predicted versus measured concentrations in Log10[mg/kg bw]
for each SMILES.

Figure 4.8: Median validation residuals and predictions versus measured concen-
trations for each SMILES in 10-fold cross-validation for carcinogenicity data, with
the left image corresponding to the single- and the right to the multiple-DNN model
in both cases.

4.5 Reprotoxicity Data Set
4.5.1 Model Performance
The reproductive toxicity data set was analysed similarly to the previous two data
sets, with the average training loss for the data set plotted against the AdamW
optimiser step count in Figure 4.9 for a) the single- and b) the multiple-DNN model.
Here, as for the other two data sets, the average fold training loss for all folds was
similar, but with some large within-fold variations in the data, as indicated by the
shaded areas. Unlike for the other two data sets, here the training loss did not show
a significant decrease at the end of the runs, indicating that the training might have
been more complete in this case. In the figure, it can be seen that the final average
training losses for this data set were around 0.2-0.3 for a) and around 0.4 for b).

32


4. Results

a) Average training loss over each fold in the single-DNN model.

b) Average training loss over each fold in the multiple-DNN model.

Figure 4.9: Average training loss over each fold in 10-fold cross-validation for in
vivo reproductive toxicity data in a) the single- and b) the multiple-DNN model.

Exactly as was included for the previous two data sets, Figure 4.10 depicts the fold
average median loss, in blue, and best average loss, in yellow, plotted together with
the average reference lossMEAN , in green, and respective standard deviation for each
fold in the carcinogenicity data. As for the previous data sets, the leftmost bar for
each value corresponds to the single- and the rightmost to the multiple-DNN model.
Moreover, as noted for the carcinogenicity data above, even though the fold averages
have similar values, there is more variation between the fold averages than in the
acute toxicity data set. Also, as pointed out in the acute toxicity and carcinogenicity
cases, the best average (average) value for each fold is much lower than the median
loss.

33


4. Results

Figure 4.10: Validation best average loss, yellow, median loss, blue, and lossMEAN ,
green, average with standard deviation over 10 folds for the reproductive toxicity
data set, with values of single-DNN model to the left in each case, and multiple-
DNN model to the right.

4.5.2 Model Results

For the reproductive toxicity data set, Figure 4.11 gives an overview of model perfor-
mance, with a) the residuals and b) the predictions against measured concentrations,
calculated as medians for each SMILES and in log10 concentration, for the single-
DNN model, to the left, and the multiple, to the right. As for the previous two data
sets, a) shows that a handful of chemicals in the data set have much higher residuals
than the rest, with the results for the single- and multiple-DNN models looking very
similar. Moreover, as in the case in the acute toxicity model (and to some degree in
the carcinogenicity case), b) indicates that both of the models have a tendency to
be too lenient on chemicals associated with very low measured doses, as indicated
by the large number of chemicals lying far to the left in the plots. In b), it can also
be seen that the multiple-DNN model seems to predict slightly higher values than
the single.

34


4. Results

a) Median validation residuals in Log10[mg/kg bw] for each SMILES.

b) Median validation predicted versus measured concentrations in Log10[mg/kg bw]
for each SMILES.

Figure 4.11: Median validation residuals and predictions versus measured concen-
trations for each SMILES in 10-fold cross-validation for reproductive toxicity data,
with the left image corresponding to the single- and the right to the multiple-DNN
model in both cases.

35


4. Results

36


5
Discussion

As hoped for, the single- and multiple-DNN models developed in this project demon-
strate the ability to predict different toxic effects by utilizing the structural infor-
mation of chemicals. Here, the first principal components of the transformer output
indicate that the transformer is successful in identifying patterns in the chemicals’
structures that correspond to their toxicity. However, both models tend to be too
lenient with data associated with low concentrations, as seen in residual and predic-
tion analysis. Moreover, the single-DNN model consistently performs better than
the multiple-DNN model, possibly due to the added levels of complexity of the lat-
ter’s task. To address all of these issues, one suggestion could be to make the loss
more stringent for negative concentrations, as well as further hyperparameter sweeps
to optimise the multiple-DNN model.

5.1 Model Performance
The training loss curves (Figures 4.1, 4.6, and 4.9) provide an initial indication of
model performance by showing whether the model can learn from the data. For-
tunately, the training loss decreases for all models and data sets analysed in the
project, although some of the plots suggest that the model is still learning at the
end of the runs. Here, it might be possible that longer training, say for 100 epochs
instead of 80, may improve performance, but this depends on whether the training
data accurately represents the test data. Notably, the average losses in each curve
and fold are similar for all cases, indicating that 10-fold cross-validation successfully
minimises fold bias, although there are large within-fold variations in the data, as
shown by large error margins. Moreover, moving on to the validation data, Table
4.3 indicates that the average performance is best for the acute toxicity data set, as
observed in Figures 4.2, 4.7, and 4.10. The reason behind this is definitely at least
in part due to the fact that the acute toxicity data set accounts for around 90% of
all in vivo data in the study and has the majority of unique chemicals, as illustrated
in Figure 3.1 and 3.2 in the Methods section. Here, the fact that there is much more
data in the acute toxicity data set compared to the other two translates to there
being a lot more information for the model to train on (as the training set is made
up of 90% of the data), meaning that the model will have many more opportunities
to get familiar with different structures.

However, the size of the data sets cannot solely explain the differences in perfor-
mance observed among the models, as while the reproductive toxicity data set is
more than twice as large as the carcinogenicity data set, it still performs worse. The

37


5. Discussion

final hyperparameter settings were determined after a set of sweeps, see Table 4.1
and 4.2, only performed on the acute toxicity and carcinogenicity data set, mean-
ing that the reproductive toxicity data set is not necessarily represented by these
settings. Potentially, this could at least partially explain the low performance of
this data set. On the other hand, when the hyperparameter sweep was performed
on the carcinogenicity data set, parameter settings were observed to have only a
marginal effect on model performance. In lieu of this, there is no clear reason why
hyperparameter settings would play a large role in model performance for the repro-
ductive toxicity data set, even though this possibility cannot be dismissed. Instead,
one possible reason why the reproductive toxicity data set performs worse than the
carcinogenicity data set, despite being larger, could be that the former is inherently
more difficult to predict due to the complexity of the data. For example, it can
be challenging to define what constitutes a LOEC value for reproductive toxicity,
whereas, for acute toxicity and carcinogenicity, the answer is clearer. Overall, ef-
fects on reproduction and offspring can have many explanations, making it difficult
to define what reproductive toxicity really constitutes. This difficulty might, for ex-
ample, result in noisier data, leading to difficulties in prediction. Hence, to improve
the model for this type of data, it might be a good idea to separate the data into
different groups and be more selective in the pre-processing step.

In analysing Table 4.3 and Figures 4.2, 4.7, and 4.10, it becomes evident that
the multiple-DNN model consistently performs worse than the single-DNN models.
Here, the main difference between the models is the number of DNNs connected to
the transformer used for interpreting the SMILES. While, for the single-DNN cases,
the transformer will prioritise parts of SMILES/chemical structures which will im-
prove performance for the specific data set the model was built for, the idea behind
the multiple-DNN model was to use one transformer to find patterns in the data that
improve SMILES translation for all networks. However, by having three separate
DNNs connected to the same transformer, it is possible that the transformer became
worse at translating SMILES in a way that is beneficial for each separate data set
due to being trained by conflicting messages from all three DNNs simultaneously.
For example, a chemical which is cancerous might not be at all acutely toxic. As
seen by the Venn diagram in Figure 3.2, only very few chemicals in the data overlap
with each other in each data set. Notably, the diagram does not indicate how large
the overlap is in structural elements of chemicals, which is potentially what might
have a larger effect on the transformer performance. Here, possibly, the transformer
embedding size used for the larger model could be increased to try and increase the
transformer’s ability to handle and adjust to different the different problems at hand.

Additionally, one reason why the multiple-DNN model performs worse than the
single-DNN model could be the data set sizes and the sampling method used. In
the implementation process of the multiple-DNN model, the difference in data set
sizes between the concatenated acute toxicity, carcinogenicity, and reproductive tox-
icity data sets became an issue. As the acute toxicity data set is much larger than
the other two, drawing random rows for training batches would exclude data from
the smaller sets and make training for any network except the acute toxicity network

38


5. Discussion

impossible, thereby breaking the model. To circumvent this problem, the smaller
data sets were upsampled to ensure that a certain percentage of the training set con-
sisted of the other two data sets in each training batch, see Methods for more details.
During training, fold-dependence was observed in the outputs, see Figures 4.2, 4.7
and 4.10, particularly for the smaller data sets. As this fold-dependence was not
that noticeable for the acute toxicity data set, but more pronounced for the smaller
other data sets, the implication becomes that the latters’ smaller sized training and
validation sets could be a drawback for the performance of the multiple-DNN model
(and make the output unstable for these cases). As the amount of upsampling
required for these data sets seems to be a critical factor that affects the model’s
performance, the best way to determine the optimal upsampling percentage might
be through a hyperparameter sweep.

Overall, the absence of a hyperparameter sweep on the multiple-DNN model could
have a significant impact on its performance. Since the task for the multiple-DNN
model is distinctly different from that of the single-DNN model, the latter’s architec-
ture may not be representative of the former’s. However, the logistics of designing
a hyperparameter sweep for the multiple-DNN model present a significant challenge
as it includes the need to determine the architecture of three DNNs simultane-
ously. Here, the extensive parameter settings and combinations to be tested would
make this a resource-intensive and time-consuming process. Consequently, future
hyperparameter sweeps for this model must be planned carefully to ensure that no
unnecessary steps are taken. As a last note, Table 4.3 and Figures 4.2, 4.7, and 4.10
reveal that the best average loss is consistently lower than the median loss average
for each fold. This outcome is not unexpected, as the best average loss accounts
for within-variation in SMILES (that is, the same SMILES occurring several times
in the validation data), making it a more accurate reflection of the model’s perfor-
mance.

5.2 Result Analysis
Moving on to discussing the model outputs, that is residuals and predictions, see
Figures 4.3, 4.8, and 4.11, it is first noted that the outputs from the single- and
multiple-DNN models are very similar for each data set, indicating that the models
work similarly. Furthermore, it seems that the models have a tendency to predict
too high concentrations for chemicals with very low measured concentrations. The
reason for this tendency is unclear, but it could be the result of an imbalance be-
tween the training and test data, where, for example, the training data contains
very few chemicals with low concentrations. It is not unthinkable that chemicals
associated with low measured concentrations represent more hazardous compounds,
meaning that this leniency in the models could become an issue if they ever were
to be used in practice. Hence, some effort should probably be taken to mitigate
this leniency. Here, for example, one way of increasing the "importance" of low-
concentration chemicals would be to make the loss harsher for these compounds.
Possibly, increasing or weighing up the chemicals with low measured concentrations
in the training set could also be considered. However, as a final note on this topic, it

39


5. Discussion

can be pointed out that some of the chemicals with very low concentrations, outliers
in the plots mention above, have very low measured concentrations relative to most
other compounds in the data sets. Hence, it is not unlikely, as will be discussed
again later, that these measurements are a result of errors in the data, reflecting the
large uncertainty and reliability issues found when dealing with large sets of data.

A more detailed analysis was performed on the results of the acute toxicity data
set. For both the single- and multiple-DNN model, the top five worst-performing
chemicals were identified and listed in Table 4.4, where it was found that all of
these chemicals had very low median measured concentrations. Interestingly, Beta-
Carotene, which is known to occur in carrots and to be highly non-toxic, was one of
the chemicals. Figure 4.4, which in a) shows all the predicted and measured concen-
trations of the worst-performing chemicals for both models, reveal that only three
chemicals, 14-Methoxymetopon, SA4503 and JWH-015, are associated with several
different measured concentrations. Here, it is evident that there is a large variation
between the measured concentrations associated with these chemicals, which could
potentially lead to difficulty in their prediction, as well as indicate some error in their
measurements. As there is a lack of replicates for the other chemicals, the model
is highly sensitive to the accuracy of the measurement for these chemicals. Hence,
once again leading back to the data reliability issue, it is possible that an error in
these measurements could be the root cause for the large residuals associated with
these chemicals.

Finally, the principal component analysis of the first two components of the CLS
tokens in the acute toxicity data, see Figure 4.5, shows some separation which seems
to correspond to the measured concentrations of the data. This would indicate that
the transformer itself is able to make the distinction between chemicals necessary to
determine if a high or low concentration is associated with it. Notably, when com-
paring the single- and multiple-DNN model PCA plots with each other, the former
seems to capture the variation slightly better, possibly as an effect of the former
performing slightly better than the latter model.

40


6
Conclusion and Future Work

In this study, the aim was to develop and assess transformer-based AI models for
predicting toxicity in mammalian in vivo assay data. To do this, two model types, a
single-DNN model and a multiple-DNN model, were designed and evaluated. Here,
the validation set used to evaluate the models contained SMILES not found in the
training set, enabling the models’ capacity to handle previously unseen data to be
measured to some extent. However, the focus of the study was primarily on assessing
the models’ ability to predict different types of chemical hazards, both separately in
the single-DNN model, and later in unison in the multiple-DNN model.

The results showed that both models achieved median losses ranging from 0.3 to 0.6
in logarithmic scale, translating to a median error of 2-4 in linear scale. Addition-
ally, PCA visualisation showed that both models successfully identified patterns in
SMILES structures related to measured concentrations of corresponding chemicals.
However, throughout all experiments, it also became evident that the multiple-DNN
model performed slightly worse than the single-DNN model. This was unexpected,
as it was expected that the transformer would be able to accommodate several DNNs
through fine-tuning. One explanation for this finding could be that, due to time lim-
itations and logistical challenges, the multiple-DNN model’s architecture never was
evaluated using hyperparameter sweeps. For potential future implementations of
the model, performing a hyperparameter sweep to determine its architecture could
therefore be a priority. Moreover, factors such as the degree of upsampling needed for
smaller data sets and the embedding size of the transformer could also be important
factors to investigate more closely in the future. Both models tended to overestimate
concentrations of compounds associated with lower measured concentrations, indi-
cating a potential problem if the models were ever to be used in practice, as these
probably correspond to more dangerous chemicals. Hence, in future adjustments of
the models, mitigating this problem, by for example setting a harsher loss on these
low-concentration chemicals, should be a priority. Finally, additional improvements
for future versions of the model could be to incorporate additional data sets together
with the inclusion of in vitro toxicological assay data to reduce dependence on in
vivo animal test data.

To conclude this report, one can ask if this project has fulfilled the promise implied
by its title. Here, undoubtedly, models have been introduced that possess significant
potential to serve as alternatives to in vivo animal testing in the future. However,
it is important to acknowledge that further research and development are neces-
sary before these methods can attain widespread adoption and effectively compete
with conventional testing approaches on a larger scale. Nonetheless, the numerous

41


6. Conclusion and Future Work

challenges associated with in vivo animal testing, as highlighted in the report’s in-
troduction, make its eventual replacement inevitable. Furthermore, in light of the
world’s increasing digitalisation and the compelling advantages in terms of cost and
time efficiency offered by these methods, it is not a matter of if, but rather when
computer-based approaches will become the new standard. In this context, it is ev-
ident that artificial intelligence (AI), with its capacity to learn and process the vast
amounts of data generated by contemporary society, will undoubtedly constitute the
cornerstone of these computer-based methods. Moreover, the successful utilisation
of transformer-based AI models in predicting chemical toxicity, as demonstrated
by this project, establishes a crucial foundation for future testing methodologies.
Given the intelligence, affordability, and reliability of this technology, it unquestion-
ably represents an important initial step towards shaping the testing methods of
tomorrow.

42


Bibliography

[1] Yang J, Zhang Y. NCRF++: An Open-source Neural Sequence Labeling
Toolkit. In: Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics; 2018. Available from: http://aclweb.org/
anthology/P18-4013.

[2] European Environment Agency (EEA). Chemicals for a sustainable future
Copenhagen, 17 May 2017 Report of the EEA Scientific Committee Semi-
nar.; May 2017. URL: https://www.eea.europa.eu/about-us/governance/
scientific-committee/reports/chemicals-for-a-sustainable-future.

[3] Mayr A, Klambauer G, Unterthiner T, Hochreiter S. DeepTox: Toxicity Pre-
diction using Deep Learning. Front Environ Sci. 2016;80. Available from:
10.3389/fenvs.2015.00080.

[4] European Chemicals Agency (ECHA). The use of alterna-
tives to testing on animals for the REACH regulation; 2021.
https://op.europa.eu/en/publication-detail/-/publication/
c53fbd08-7fbc-11eb-9ac9-01aa75ed71a1/language-en#.

[5] European Chemicals Agency (ECHA). REACH Information requirements;
2022. [Online; accessed 09-October-2022]. https://echa.europa.eu/
regulations/reach/registration/information-requirements.

[6] Scholz, S and Sela, E and Blaha, L and Braunbeck, T and Galay-Burgos, M
and García-Franco, M et al . A European perspective on alternatives to ani-
mal testing for environmental hazard identification and risk assessment; 2022.
https://doi.org/10.1016/j.yrtph.2013.10.003.

[7] Cherkasov, A and Muratov, E N and Fourches, D and Varnek, A and Baskin,
I I and Cronin, M et al . QSAR Modeling: Where Have You Been? Where Are
You Going To?; 2014. https://doi.org/10.1021/jm4004285.

[8] David L, Thakkar A, Mercado R, Engkvist O. Molecular representations in AI-
driven drug discovery: a review and practical guide. J Cheminform. 2020;39.

[9] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez Aea. At-
tention is all you need. Advances in neural information processing systems.
2017. Available from: https://www.sciencedirect.com/science/article/
pii/S0169743997000610.

[10] Laws, EA. Environmental Toxicology.[electronic resource]: Selected
Entries from the Encyclopedia of Sustainability Science and Tech-
nology. Springer New York. 2013:1-15. Available from: https:
//search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=
clec.SPRINGERLINK9781461457640&site=eds-live&scope=site.

43

http://aclweb.org/anthology/P18-4013
http://aclweb.org/anthology/P18-4013
https://www.eea.europa.eu/about-us/governance/scientific-committee/reports/chemicals-for-a-sustainable-future
https://www.eea.europa.eu/about-us/governance/scientific-committee/reports/chemicals-for-a-sustainable-future
10.3389/fenvs.2015.00080
https://op.europa.eu/en/publication-detail/-/publication/c53fbd08-7fbc-11eb-9ac9-01aa75ed71a1/language-en#
https://op.europa.eu/en/publication-detail/-/publication/c53fbd08-7fbc-11eb-9ac9-01aa75ed71a1/language-en#
https://echa.europa.eu/regulations/reach/registration/information-requirements
https://echa.europa.eu/regulations/reach/registration/information-requirements
https://doi.org/10.1016/j.yrtph.2013.10.003
https://doi.org/10.1021/jm4004285
https://www.sciencedirect.com/ science/article/pii/S0169743997000610.
https://www.sciencedirect.com/ science/article/pii/S0169743997000610.
https://search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=clec.SPRINGERLINK9781461457640&site=eds-live&scope=site
https://search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=clec.SPRINGERLINK9781461457640&site=eds-live&scope=site
https://search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=clec.SPRINGERLINK9781461457640&site=eds-live&scope=site


Bibliography

[11] Fisher, MR. Environmental Biology: 6.3v Environmental Toxicol-
ogy. Open Oregon Educational Resources. Available from: https:
//search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=
clec.SPRINGERLINK9781461457640&site=eds-live&scope=site.

[12] Agency SC. Hazard and risk assessment of chemi-
cals - an introduction; 2020. https://www.kemi.se/
download/18.32f4eb311753c0a67fe1cf6/1604653630900/
Guidance-Hazard-and-risk-assessment-an-introduction.pdf.

[13] European Chemicals Agency (ECHA). Guidance on registra-
tion; 2021. [Online; accessed 11-April-2023]. https://echa.
europa.eu/documents/10162/2324906/registration_en.pdf/
de54853d-e19e-4528-9b34-8680944372f2?t=1629205524601.

[14] Jain AK, Mao J, Mohiuddin KM. Artificial neural networks: A tutorial. Com-
puter; 1996.

[15] Schmidhuber J. Deep learning in neural networks: An overview. Neural Net-
works. 2015;61:85-117. Available from: https://www.sciencedirect.com/
science/article/pii/S0893608014002135.

[16] Svozil D, Kcasnicka V, Pospichal J. Introduction to multi-layer feedforward
neural networks. Chemometrics and Intelligent Laboratory Systems. 1997;39.
Available from: https://www.sciencedirect.com/science/article/pii/
S0169743997000610.

[17] Käll S. Predicting Chemical Ecotoxicity using Artificial Intelligence. 2022.
[18] He C. Transformer in CV ; December 2021. https://towardsdatascience.

com/transformer-in-cv-bbdb58bf335e.
[19] Sequence Modeling with Neural Networks (Part 2): At-

tention Models; 2016. https://indicodata.ai/blog/
sequence-modeling-neural-networks-part2-attention-models/.

[20] Alammar J. The Illustrated Transformer ; 2020. http://jalammar.github.
io/illustrated-transformer/.

[21] Devil J, Chang W M, Lee KT. Google, and A. I Language. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understand-
ing.. Tech. rep.;. [Online; accessed 16-April-2023]. https://github.com/
tensorflow/tensor2tensor.

[22] Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidi-
rectional Tranformers for Language Understanding. Tech rep. 2019. Available
from: https://doi.org/10.48550/arXiv.1810.04805.

[23] Muller B. BERT 101 State of The Art NLP Model Explained; 2022. [Online;
accessed 16-April-2023]. https://huggingface.co/blog/bert-101.

[24] Liu Y, Ott M, Goyal N, Joshi M, Chen D, Levy Oea. RoBERTa: A Ro-
bustly Optimized BERT Pretraining Approach. Advances in neural information
processing systems. July 2019. Available from: https://doi.org/10.48550/
arXiv.1907.11692.

[25] Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self-
Supervised Pretraining for Molecular Property Prediction. arXiv preprint
arXiv:20100909885. 2020.

44

https://search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=clec.SPRINGERLINK9781461457640&site=eds-live&scope=site
https://search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=clec.SPRINGERLINK9781461457640&site=eds-live&scope=site
https://search.ebscohost.com/login.aspx?direct=true&db=cat07472a&AN=clec.SPRINGERLINK9781461457640&site=eds-live&scope=site
https://www.kemi.se/download/18.32f4eb311753c0a67fe1cf6/1604653630900/Guidance-Hazard-and-risk-assessment-an-introduction.pdf
https://www.kemi.se/download/18.32f4eb311753c0a67fe1cf6/1604653630900/Guidance-Hazard-and-risk-assessment-an-introduction.pdf
https://www.kemi.se/download/18.32f4eb311753c0a67fe1cf6/1604653630900/Guidance-Hazard-and-risk-assessment-an-introduction.pdf
https://echa.europa.eu/documents/10162/2324906/registration_en.pdf/de54853d-e19e-4528-9b34-8680944372f2?t=1629205524601
https://echa.europa.eu/documents/10162/2324906/registration_en.pdf/de54853d-e19e-4528-9b34-8680944372f2?t=1629205524601
https://echa.europa.eu/documents/10162/2324906/registration_en.pdf/de54853d-e19e-4528-9b34-8680944372f2?t=1629205524601
https://www.sciencedirect.com/science/article/pii/S0893608014002135
https://www.sciencedirect.com/science/article/pii/S0893608014002135
https://www.sciencedirect.com/ science/article/pii/S0169743997000610.
https://www.sciencedirect.com/ science/article/pii/S0169743997000610.
https://towardsdatascience.com/transformer-in-cv-bbdb58bf335e
https://towardsdatascience.com/transformer-in-cv-bbdb58bf335e
https://indicodata.ai/blog/sequence-modeling-neural-networks-part2-attention-models/
https://indicodata.ai/blog/sequence-modeling-neural-networks-part2-attention-models/
http://jalammar.github.io/illustrated-transformer/
http://jalammar.github.io/illustrated-transformer/
https://github.com/tensorflow/tensor2tensor
https://github.com/tensorflow/tensor2tensor
https://doi.org/10.48550/arXiv.1810.04805
https://huggingface.co/blog/bert-101
https://doi.org/10.48550/arXiv.1907.11692
https://doi.org/10.48550/arXiv.1907.11692


Bibliography

[26] Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi Aea. Transformers:
State-of-the-art natural language processing. Proceedings of the 2020 conference
on empirical methods in natural language processing: system demonstrations.
2020:38-45.

[27] Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan Gea. Pytorch: An
imperative style, high-performance deep learning library.. Advances in neural
information processing systems 32; 2019.

[28] Paul, S. Bayesian Hyperparameter Optimization - A Primer ; 2020.
[Online; accessed 11-April-2023]. https://wandb.ai/site/articles/
bayesian-hyperparameter-optimization-a-primer.

[29] Sequence Modeling with Neural Networks (Part 2): Attention Mod-
els;. https://paperswithcode.com/method/linear-warmup#:~:
text=Linear%20Warmup%20is%20a%20learning,the%20early%20stages%
20of%20training.

[30] Gaoxia, J and Wenjian, W. Error estimation based on variance analysis of
k-fold cross-validation. Elsevier LTD. 2017. Available from: http://dx.doi.
org/10.1016/j.patcog.2017.03.025.

[31] Muller B. BERT 101 State of The Art NLP Model Explained; 2022. [Online;
accessed 16-April-2023]. https://pubchem.ncbi.nlm.nih.gov/.

I

https://wandb.ai/site/articles/bayesian-hyperparameter-optimization-a-primer
https://wandb.ai/site/articles/bayesian-hyperparameter-optimization-a-primer
https://paperswithcode.com/method/linear-warmup#:~:text=Linear%20Warmup%20is%20a%20learning,the%20early%20stages%20of%20training.
https://paperswithcode.com/method/linear-warmup#:~:text=Linear%20Warmup%20is%20a%20learning,the%20early%20stages%20of%20training.
https://paperswithcode.com/method/linear-warmup#:~:text=Linear%20Warmup%20is%20a%20learning,the%20early%20stages%20of%20training.
http://dx.doi.org/10.1016/j.patcog.2017.03.025
http://dx.doi.org/10.1016/j.patcog.2017.03.025
https://pubchem.ncbi.nlm.nih.gov/


Bibliography

II


A

Appendix 1

A.1 PCA for Carcinogenicity and Reproductive
Toxicity Data Sets

In this appendix, the principal component analysis plots for the CLS tokens from
one fold of the carcinogenicity and the reproductive toxicity data sets can be found,
see Figure A.1. In the figure, the individuals have been coloured according to the
corresponding median measured concentration for the SMILES/CLS token, with
lighter yellow corresponding to higher measured concentrations and blue/black to
lower. Moreover, for both the carcinogencity and the reproductive toxicity data
set, the leftmost plot corresponds to the PCA of the single-DNN model, and the
rightmost for the PCA of the multiple-DNN model.

III


A. Appendix 1

a) PCA of the CLS tokens from one fold in the carcinogenicity data set.

b) PCA of the CLS tokens from one fold in the reproductive toxicity data set.

Figure A.1: Principle component analysis of CLS tokens from one fold in 10-
fold cross-validation of the a) the carcinogenicity data set, and b) the reproductive
toxicity set, coloured by corresponding median concentration for each CLS, with the
left plot corresponding to the single- and the right to the multiple-DNN model in
each case.

IV


DEPARTMENT OF SOME SUBJECT OR TECHNOLOGY
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Figures
	List of Tables
	Introduction
	Aims and Scope

	Theory
	Environmental Toxicology
	Deep Neural Networks
	Simplified Molecular Input Line Entry System (SMILES) 
	Natural Language Processing and The Transformer Architecture
	BERT-Based Transformers


	Methods
	In Vivo Toxicological Data
	Pre-processing

	Architecture
	Training
	K-Fold Cross-Validation
	Median Loss, Best Average Loss and LossMEAN


	Results
	Architecture
	Model Performance: 10-Fold Cross-Validation
	Acute Toxicity Data Set
	Model Performance
	Model Results

	Carcinogenicity Data Set
	Model Performance
	Model Results

	Reprotoxicity Data Set
	Model Performance
	Model Results


	Discussion
	Model Performance
	Result Analysis

	Conclusion and Future Work
	Bibliography
	Appendix 1
	PCA for Carcinogenicity and Reproductive Toxicity Data Sets