An NLP approach to assess
information security policies

Application of GPT-3 within a policy domain

Master’s thesis in Computer science and engineering

Hampus Lundblad
Pouya Faramarzi

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2022


Master’s thesis 2022

An NLP approach to assess
information security policies

Application of GPT-3 within a policy domain

Hampus Lundblad
Pouya Faramarzi

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2022


An NLP approach to assess information security policies
Application of GPT-3 within a policy domain
Hampus Lundblad
Pouya Faramarzi

© Hampus Lundblad 2022.
© Pouya Faramarzi 2022.

Supervisor: Miroslaw Staron, Interaction Design and Software Engineering
Advisor: Daniel Dalevi, Centiro
Examiner: Lucas Gren, Interaction Design and Software Engineering

Master’s Thesis 2022
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2022

iv


An NLP approach to assess information security policies
Application of GPT-3 within a policy domain
Hampus Lundblad
Pouya Faramarzi
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Threats to companies’ information security are ever-increasing, and to adequately
protect the companies’ information assets; a proper information security policy needs
to be established. For this purpose, information security standards such as ISO
27001:2013, created by the International Organization of Standardization, exist.
However, for a policy to be complete towards ISO 27001:2013, the policy must fulfill
up to 114 different requirements, also called controls. Experts within information
security policies often do this work, which can be time-consuming and error-prone.
Due to this, this study aimed to use natural language machine learning models to
classify if a text extract from a given information security policy is complete to-
wards a specified control or not. Ultimately the study wants to investigate whether
language models are a good fit for software engineering topics that are also business-
critical.
The study utilized the design science methodology. A framework for determining
policy completeness was constructed and different natural language machine learn-
ing classifiers were evaluated. The main focus was on the large-scale pre-trained
model GPT-3 by OpenAI. Three different datasets were constructed to train the
models, each consisting of annotated text extracts from information security policy.
These were labeled as either being ISO certified or not, depending on if the company,
or the policy itself, mentioned an ISO certification. The models were then evaluated
on these three datasets, where the metrics for evaluation were F1-score and accu-
racy. Lastly, a validation session with a policy expert from a case company that
specializes in software solutions and policy compliance was conducted to determine
how GPT-3’s evaluation of policies compares to the evaluation of an expert.
The results showed that GPT-3 and the pre-trained word embedding model GloVe
with SVC as a classifier could perform better in policy classification than other
machine learning models. However, when compared to an expert, GPT-3 fails to
distinguish between policies that are not complete towards ISO and policies that are
partially complete towards ISO. Something which the policy expert was able to do.
We conclude that GPT-3 has the potential to perform well in the domain of infor-
mation security policy. However, due to a lack of data and expertise in the domain
of information security policies, the results from the validation session do not reflect
this. Hence, the authors provide a discussion regarding this and recommendations
for future work.

Keywords: software engineering, information security policy, ISO, NLP, OpenAI,
GPT-3, machine learning

v


Acknowledgements
We would like to express our gratitude to our supervisor from Chalmers, Miroslaw
Staron, who provided valuable and relevant feedback, helped us with the direction of
the thesis, and provided support throughout the entire project. We would also like
to thank Centiro for offering us the opportunity to pursue our thesis together with
them. A special thank you to Daniel Dalevi, Mikael Böörs, Gustaf Stawåsen, and
Thomas Herkel from Centiro for their expertise and support. Their feedback and
perspective has helped us immensely. Lastly we would like to thank our examiner
Lucas Gren, who read and gave us feedback on our thesis and how to complete it.

Hampus Lundblad, Gothenburg, June 2022
Pouya Faramarzi, Gothenburg, June 2022

vi


viii


Contents

List of Figures xi

List of Tables xv

1 Introduction 1
1.1 Practical scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Report structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 7
2.1 Text classification in Natural Language Processing . . . . . . . . . . . 7
2.2 Policy classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 GPT-3 & BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Background 11
3.1 Domain background . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 ISO 27001:2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Text normalization . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Text vectorization . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.2.1 Frequency based representations . . . . . . . . . . . 16
3.3.2.2 Sequence based representations . . . . . . . . . . . . 16
3.3.2.3 Contextual based representations . . . . . . . . . . . 17

3.3.3 Pre-trained models . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3.1 Transformers . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3.2 GPT-3 by OpenAI . . . . . . . . . . . . . . . . . . . 19
3.3.3.3 Using GPT-3 . . . . . . . . . . . . . . . . . . . . . . 20
3.3.3.4 Using GPT-3 as a classifier . . . . . . . . . . . . . . 21
3.3.3.5 BERT by Google . . . . . . . . . . . . . . . . . . . . 23

3.4 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . 24

4 Research Design 27
4.1 Problem Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Treatment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.2 Data collection and overview . . . . . . . . . . . . . . . . . . . 29

ix


Contents

4.2.2.1 Data annotation . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3.1 Control selection . . . . . . . . . . . . . . . . . . . . 33
4.2.3.2 Data preparation and models . . . . . . . . . . . . . 33

4.2.4 Using GPT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.4.1 Uploading data . . . . . . . . . . . . . . . . . . . . . 36
4.2.4.2 Few-shot mode . . . . . . . . . . . . . . . . . . . . . 36
4.2.4.3 Fine-tuning mode . . . . . . . . . . . . . . . . . . . . 37

4.3 Treatment Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Metrics and tools . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1.1 Precision, Recall and F1-score . . . . . . . . . . . . . 38
4.3.1.2 K-fold cross-validation . . . . . . . . . . . . . . . . . 38

4.3.2 Model comparisons . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Expert validation with the case company . . . . . . . . . . . . 40
4.3.4 Domain result comparison . . . . . . . . . . . . . . . . . . . . 41

4.4 Weekly meetings with case company . . . . . . . . . . . . . . . . . . 42

5 Results 43
5.1 Model comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 GPT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1.1 Few-shot . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1.2 Fine-tuning mode . . . . . . . . . . . . . . . . . . . . 44

5.1.2 Comparison with benchmark models . . . . . . . . . . . . . . 45
5.2 Expert validation with the case company . . . . . . . . . . . . . . . . 46
5.3 Domain result comparison . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Discussion 53
6.1 Framework evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 GPT-3 and information security policy completeness . . . . . . . . . 53

6.2.1 GPT-3 compared to benchmark models . . . . . . . . . . . . . 53
6.2.2 GPT-3 compared to expert results . . . . . . . . . . . . . . . 54

6.3 GPT-3 compared to other domains . . . . . . . . . . . . . . . . . . . 55
6.4 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.4.1 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 Conclusion 59

Bibliography 66

A Appendix 1 I
A.1 Search words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 E-mail template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

x


List of Figures

1.1 Activity diagram of a practical scenario. . . . . . . . . . . . . . . . . 3
1.2 Activity diagrams of a control mapping (left) and a completeness

check (right). As an example, the control of ”Policy on the use of
cryptographic controls" from ISO 27001:2013 was used and its imple-
mentation guidance provided in ISO 27002:2013 was referenced when
determining examples of characteristics for completeness. . . . . . . . 3

3.1 ISO 27001:2013 extract of the domain Human resource security. The
table is an extract from table A.1 in Annex A of SS-EN ISO/IEC
27001:2017 and is reproduced with due permission from SIS, the Swedish
Institute for Standards, who holds the copyright and also sells the
complete standard www.sis.se. . . . . . . . . . . . . . . . . . . . . . . 12

3.2 ISO 27002:2013 extract of the implementation guidance for the con-
trol A.7.3.1 defined in ISO 27001. The text is taken from SS-EN
ISO/IEC 27002:2017 and is reproduced with due permission from SIS,
the Swedish Institute for Standards, who holds the copyright and also
sells the complete standard www.sis.se. . . . . . . . . . . . . . . . . . 13

3.3 Text normalization example of processing a sentence with each of the
following NLP tasks: tokenization, case-folding, stop-word removal,
and lemmetization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Text vectorization example using a frequency based approach. . . . . 15
3.5 A graph displaying different NLP models along with the number of

parameters for each model [1], [2], [3], [4], [5]. . . . . . . . . . . . . . 18
3.6 A simplified overview of the transformer’s architecture. . . . . . . . . 19
3.7 The GPT-3 playground where the user can input tasks. In this case

GPT-3 was asked to validate if a given code line is valid in the pro-
gramming language Python. The green text is GPT-3’s response. . . 21

3.8 The playground prompt where GPT-3 is used as a classifier in its text
completion mode. The green text is GPT-3’s response. . . . . . . . . 22

3.9 The playground prompt where GPT-3 is used as a classifier in its text
completion mode. The green text is GPT-3’s response. Here GPT-3
fails to classify encryption as a part of the ISO 27001:2013 standard. . 22

3.10 The four step procedure taken by GPT-3 when used in its few-shot
setting. Source of the image is https://beta.openai.com/docs/guides/classifications 23

xi


List of Figures

3.11 Response from GPT-3 when sending the query "Is Cryptography part
of ISO 27001:2013?"; done using the python library OpenAI. GPT-3’s
answer can be seen at line 3 where it labels the input as Yes. In the
selected_examples column returns how useful the uploaded examples
were in classifying the input. Higher score means more useful. . . . . 24

3.12 The SVM classifier using a boundary line to separate and classify the
datapoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 The engineering and design cycle as defined by Wieringa. . . . . . . . 27
4.2 The full list of contacted companies. In the Name column, the con-

tacted company’s name is listed. In the Response column, the re-
sults from the exchanges are listed. The cells were marked green if a
response was received where they disclosed that they would, or would
not, share their policy, or if they would redirect us. From two compa-
nies information security policies were received, therefore their name
was redacted to [Company]. . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 A bar-plot of the dataset containing ISO and non-ISO data points. . 31
4.4 The template used for data annotations including an example for the

sake of illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Framework architecture for information security policy control clas-

sification in relation to ISO. . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 The overall machine learning pipeline. From documents, into the

machine learning pipeline which consists of an NLP pipeline and a
machine learning classifier, and yields an output. . . . . . . . . . . . . 35

4.7 Detailed view of the machine learning pipeline in Figure 4.6 including
sub-tasks and models used. . . . . . . . . . . . . . . . . . . . . . . . . 35

4.8 The template used for expert validation session with an example for
the sake of illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 The results of using the models on the A512 dataset. GPT-3 in it’s few
shot setting with Ada as search model and Curie as the classification
model is the best performing by achieving an accuracy of 0.7 and a
F1-score of 0.727. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Results from all models evaluated on the dataset A722. The purple
dotted line shows the ZeroR baseline. . . . . . . . . . . . . . . . . . . 46

5.3 Results from all models evaluated on the dataset A923. For the model
Word2Vec a F1-score was unable to be calculated. The purple dotted
line shows the ZeroR baseline. . . . . . . . . . . . . . . . . . . . . . 46

5.4 Scatter plots for each dataset, where the yellow dots are the expert’s
answers to how complete a policy text extract was towards a given
ISO control. The red and blue dots represents GPT-3’s probability
towards the text extract being related to an ISO certified policy. The
X-axis represents which text was used. The Y-axis represents the
answers towards how complete the text was towards ISO 27001:2013. 47

xii


List of Figures

5.5 Residual plots for each dataset, where the error is calculated by using
the expert’s answers as the true value. If the dots are closer to the
0.0 line, then they are more aligned with the expert’s opinion. The
Y-axis shows the error, and the X-axis shows for which text the error
was calculated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xiii


List of Figures

xiv


List of Tables

3.1 The data used in the example. . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Table of selected controls. The table is an extract from table A.1 in
Annex A of SS-EN ISO/IEC 27001:2017 and is reproduced with due
permission from SIS, the Swedish Institute for Standards, who holds
the copyright and also sells the complete standard www.sis.se. . . . . . 34

4.2 The hyperparameters of GPT-3, in its few-shot setting, when clas-
sifying Information Security Policies. In this table, logprobs = 2
indicated that the model returned the logarithmic probability values. 36

4.3 Confusion Matrix, here TP = True Positives, FN = False Negatives,
FP = False Positives, TN = True Negatives. . . . . . . . . . . . . . . 38

4.4 Table of benchmark models with their word embedding types and
models as well as the combined classifier. . . . . . . . . . . . . . . . . 39

5.1 Results of applying GPT-3 few-shot to the three different datasets.
The bold text is used to show the highest value for each category. For
values of K > 1 the average score was calculated across the runs. . . 44

5.2 F1-score and accuracy score for the fine-tuned GPT-3 model. The
scores in bold are the best performing for that model and dataset. . 44

5.3 Features found with an increasing significance level for each control. . 49
5.4 Expert ranked features based on most characterizing for each control

using a Likert scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 The scores from GPT-3 applied on different datasets. BioText, Med-

STS and PubMedCRT are from the study by Moradi et. al [6]. ADE
and NIS are from the study by Alex et. al [7]. The datasets A512,
A722, and A923 are taken from the results of this study, whre the
highest F1-score were chosen for each dataset. . . . . . . . . . . . . . 51

xv


List of Tables

xvi


1
Introduction

Compliance with International Organization of Standards (ISO) standards regard-
ing information security policies requires an organization to have an information
security policy. Still, it is not easy to create a good one. An information security
policy alone is far from sufficient in terms of providing adequate safety measures for
an organization, and due to the effects of the increased significance of information
technology, it has also received considerable attention. Additionally, the protection
and security of information assets by organizations have become an increasingly
more challenging task as the complexity of security threats has also increased [8].
The fact that large and high status companies such as Twitch (game-streaming plat-
form) have been subjected to a data leak [9] and Kaseya (IT company that provides
software to, among others, COOP) have been subjected to a ransomware attack [10]
within the last twelve months signalizes that no organization is truly safe. Thus, on
behalf of the stakeholders, there now exists a pressure and demand that organiza-
tions accept their responsibility in terms of offering adequate information security
measurements [11].

Well-known standards, such as the COBIT or the ones defined by ISO [11], provide
guidelines and frameworks with various objectives that are used towards implement-
ing a robust information security policy that reflect the needs and risks of an orga-
nization [12]. Although, it has been suggested that the guidelines provided by the
standards are too generic, and organizations find it challenging to assess the com-
pleteness of their information security policy concerning them, it ultimately presents
a key issue towards achieving certification [13] [14]. As a result, organizations find
themselves in a time-consuming and expensive certification process [15].

A considerable amount of research has been conducted into assessing the complete-
ness of privacy policies, mainly by utilizing machine learning methods [16] [17].
However, little research has been done within the domain of information security
policies. Therefore it is not yet clear whether the use of machine learning methods
can be modified to be applied within software engineering domains that are also
business-critical. In particular, to assess the completeness of information security
policies. Hence, additional studies on using machine learning methods within the
domain of information security and its policies are needed.

This study aims to alleviate the aforementioned completeness issue by investigating
to what extent Natural Language Processing (NLP) models can help with determin-
ing information security completeness in relation to the ISO 27001:2013 standard

1


1. Introduction

with the intent to generalize the findings by offering a framework for completeness
checking. Due to scarce publicly available information security policies, pre-trained
language models, such as GPT-3, are used to maximize model learning rates with a
small sample size.

Additionally, this thesis aims to provide a framework that can be used without
expert analysis to adjust an organization’s information security policy and fill the
gap between the guidelines and individual characteristics of an organization by using
accessible and inexpensive methods. Furthermore, there also exists a gap between
defining a policy and applying it in practice. For example, it is possible for any
organization to have an adequate information security policy in relation to ISO,
but difficult to establish whether the organization actually implements the policy in
practice by reviewing the policy alone. Hence, the study deals with "completeness",
i.e., the presence of critical elements, rather than "compliance", in relation to ISO. In
order to investigate the practical feasibility of the study as a real-world application
and gain access to the domain knowledge of experts, a collaboration was established
with Centiro. Centiro, which is henceforth referred to as this thesis’s case company,
is an organisation specializing in software solutions and policy compliance and is
located in Borås, Sweden.

1.1 Practical scenario
The starting point for any organization to achieve ISO certification is to first define
an ISO 27001 scope statement. The scope statement sets the boundaries on what
processes, products, or departments an information security management system
should cover within the organization. More importantly, the scope also allows for
choosing what aspects of the ISO 27001 standard are needed to be implemented in
order to be granted an ISO certification. These aspects are more commonly referred
to as controls and the ISO 27001:2013 is made up of 114 of these controls. Hence,
an organization is only needed to comply with controls that are deemed necessary
based on their set scope. In other words, not all 114 controls are required in an in-
formation security policy for an organization to achieve ISO certification [18]. Thus,
the first step of completeness checking becomes to determine the controls to check
against and map what part of the policy corresponds to which control. The second
step in the process is to assess the completeness of the controls by confirming the
presence of key elements crucial to the controls. Experts in the field commonly
perform the second step, and the key elements are defined by the characteristics
of a control. In an ISO 27001 completing document, referred to as ISO 27002, a
guidance for control implementation is provided for each control and can be used as
grounds to identify the key elements [19]. Observe the activity diagrams in Figures
1.1 and 1.2 for illustrations of the process overview and an example of mapping and
completeness checking of a control.

In the diagram given by Figure 1.1, a circle symbolizes the start and endpoint, a
rectangle represents a process, and a rectangle (or rectangles) with a wavy bottom
edge represents a document to serve as an input or output used by the processes.

2


1. Introduction

Meanwhile, in the activity diagrams given by Figure 1.2, a circle symbolizes the
start and endpoint, a diamond coupled with a question represents a decision, a di-
amond without question represents a merge, a green rectangle indicates a resulting
successful action, and a red rectangle indicates a resulting unsuccessful action.

Figure 1.1: Activity diagram of a practical scenario.

Figure 1.2: Activity diagrams of a control mapping (left) and a completeness
check (right). As an example, the control of ”Policy on the use of cryptographic
controls" from ISO 27001:2013 was used and its implementation guidance provided
in ISO 27002:2013 was referenced when determining examples of characteristics for
completeness.

In Figure 1.1, the practical scenario is given from start to finish in an activity dia-
gram on a holistic level. It is divided into one input, an information security policy,
leading to two processes and two outputs, one for each step. The output from the
control mapping step is a collection of extracts representing the result of annotating
the information security policy by controls. The left activity diagram in Figure 1.2
is an example of such mapping, where a scope statement defines the need for the
control [18].

After the analysis has been established, the second step in the process is to check
the extracted control’s completeness by confirming the presence of key factors. An
example of this is given in the right activity diagram in Figure 1.2. The output from
the completeness checking is then a completeness assessment where each control is

3


1. Introduction

deemed either complete or incomplete. Of course, for an expert to manually an-
notate an information security policy that may have 114 control present and check
for completeness of each one is not only time-consuming and labor-intensive, but
also error-prone. Furthermore, a non-expert who lacks the knowledge of what is
required to be complete in relation to ISO for each control, for example, what a
strong cryptographic algorithm is, would struggle to define the lowest levels of the
information security policy. Hence, supplying both experts and organizations with
an automated solution is desirable.

An automated solution could also be of interest to software engineers, not only be-
cause it can dictate what software tools and methods of working are available to
them, but also since it promotes the practice of information security aware develop-
ment. Therefore, product owners who might lack legal expertise and legal experts
who may lack knowledge of the software domain, should work in close collaboration
to define a consensus on various elements for products and processes which could in
its turn stem requirements for, for example, reliability and security.

1.2 Research questions
The following research questions are divided into two major themes where one is
made towards establishing a quality framework for evaluating information security
policies in order to identify missing ISO compliance factors (RQ1), while the sec-
ond one focuses on data analysis, model analysis, model evaluation, and comparison
(RQ2). Ultimately, the latter research question aims to support the first with a
language model.

RQ1: What characterizes a good quality machine learning framework based on fac-
tors such as the amount of data and manual labor needed for information
security policy?

RQ2: To which degree does a GPT-3 language model determine the ISO complete-
ness of various organizations’ information security documents?

RQ2.1: Which machine learning model features are beneficial for determining doc-
ument coverage and alignment in relation to ISO?

RQ2.2: To what degree can the features of GPT-3 enhance the document classifi-
cation process, and how does it perform versus other algorithms?

RQ2.3: How does the GPT-3 model trained and evaluated on information security
policy documents compare to performances of GPT-3 models applied in other
domains?

Addressing RQ1 leads to the design of a framework, which lays the foundation for
the document classification in RQ2. RQ2, a broader question, is further broken
down into three sub-research questions. RQ2.1 allows for experimentation and
analysis of different machine learning algorithms (and their various properties) to
establish an optimal benchmark that is needed to compare future results with. An-
swering RQ2.2 leads to comparisons with the GPT-3 model and the previously
established benchmark while also leading to an investigation into what aspects of

4


1. Introduction

GPT-3 aids (or hinders) in its classification ability versus standard algorithms. Fi-
nally, addressing RQ2.3 leads to comparisons of achieved performance results of
GPT-3 with applications within other domains to deem whether information secu-
rity policy completeness is a reasonable domain of application of GPT-3.

1.3 Delimitation
This thesis uses existing information security policies and already implemented nat-
ural language processing models and does not define nor create any new ones. Fur-
thermore, the time restrictions of this thesis limit the size of the dataset of gathered
information security policy documents; with more time, a more defined and versatile
dataset could be created. Furthermore, this thesis does not deal with compliance
but rather completeness, as the former requires more attention and involvement of
ISO-compliance experts. Finally, this thesis does not attempt a real-world applica-
tion and evaluation of models but focuses on designing a possible solution to the
problem context.

1.4 Report structure
The remainder of this paper is divided into six chapters, where Chapter 2 covers the
existing research and Chapter 3 provides the theory behind the study. Chapter 4
explains the execution of the study, while Chapter 5 presents the results. Afterwards,
Chapter 6 provides discussion in relation to the results, threats to validity, and
suggestions for future work. Finally, Chapter 7 provides explicit answers to the
research questions together with final remarks.

5


1. Introduction

6


2
Related Work

The area of text classification can roughly be divided into two categories [20], Rule-
based methods and Machine Learning (ML) based methods. Rule-based methods
require the researchers to have deep domain knowledge and use pre-defined rules
to classify texts. In contrast, ML methods require models and pre-labeled data to
learn the relations between the texts and their corresponding labels.

In this section, the area of machine learning methods in natural language process-
ing is further studied to understand what methods researchers are currently using
to achieve state-of-the-art performance in terms of preprocessing steps, models, re-
sampling procedures, and evaluation metrics. Furthermore, a closer look at similar
studies involving the classification of policies are provided, to better estimate what
the field looks like and what approaches different researchers have taken to tackle
the problem of policy classification. Lastly, it is vital to understand how large-scale
models such as GPT-3 and BERT are used in current research, their limitations,
and the steps needed to achieve good performance.

2.1 Text classification in Natural Language Pro-
cessing

To better understand which ML models and preprocessing steps are commonly used
in research concerning natural language processing, two main studies have been
examined. First, Rahman et al. [21] used standard classification models, such as
Support-Vector Machine (SVM) and Random forest, to classify sentiments of tweets
in two different datasets. As a pre-sampling procedure, the researchers used K-
fold cross-validation, where k = 4, to divide the data into training and validation
sets. The best performing model (MaxEnt) achieved an average F1-score of 76%.
Whereas an F1-score is a measurement of how well a model performs with the ad-
dition of taking into account any misclassification. [22] used ML models to classify
emails as either phishing emails or not. The models used were SVM, Naïve Bayes,
Decision Tree, Long Short Term Memory (LSTM), and Convolutional Neural Net-
works. The model with the best average accuracy was leveraging Convolutional
Neural Networks. Furthermore, similar studies have been done by Miao. et al
[23]. who used ML models to classify Chinese newspapers. The researchers used
several different models, but the conclusion was that a Support Vector Machine
(SVM) with a TF-IDF vectorizer yielded the best F1-score of 95.7%. Dadgar et al.
[24] conducted a similar study; however, they analyzed English newspapers instead.

7


2. Related Work

The evaluation was performed on two datasets, one from BBC, which contained
five categories, and one from 20NewsGroup, which contained 20 categories. Their
best-performing model was SVM with a TF-IDF vectorizer. The model was used on
a dataset from the BBC and 20NewsGroup, where it achieved an F1-score of 95.67%.

Furthermore, a similar study was made by Tzimourtas et al. [25] where SVM, Ran-
dom Forest, and Naive Bayes were compared on the 20NewsGroup dataset. The
best scoring model was SVM, with an accuracy of 95%. Even though SVM is a well-
performing classifier, other models can also be used for text classification. Kim [26]
investigated if simple Convolutional Neural Networks (CNN) with a small number
of hyperparameters could perform well on text classification tasks. The researcher
tested four models on seven different datasets and managed to achieve a better per-
formance when compared to other studies that were made at the time.

Sharma and Moh [27] conducted a study that used classifiers such as SVM, Naive
Bayes, and a dictionary-based classifier to predict the outcome of the Indian election
by determining the sentiment of tweets related to the election. The result was that
the best performing model was SVM with an accuracy of 78.4%.

2.2 Policy classification
In policy classification, based on the studies included in this section, the most popu-
lar policy to classify were privacy policies. However, these still give a good overview
of what different approaches are common when classifying policies.

Story et. al [28]. conducted a study where privacy policies of mobile applications
were analyzed to determine if the privacy policy covered the kind of data that the
application was accessing and potentially sharing. They divided the privacy policy
into categories, such as Email address, which indicated whether the app collected
the user’s email address or not. For example the text "We collect your email address"
would be classified as True by the Email Address classifier, but false by the GPS Lo-
cation classifier. Furthermore, each classifier was split into classifying if a first-party
or third party accessed the data. Hence both GPS Location 1st Party classifier and
GPS Location 3rd Party classifier needed to be trained. The reason for dividing the
classifiers in this manner was that data could be accessed by a first-party but not a
third party in privacy policies. To train their models, Story et al. used an annotated
set of documents, also known as an annotated corpus. Preprocessing steps such as
stop word removal, vectorization, and normalizing of sentences were conducted in
order to improve the models’ performances. Along with vectorization, the authors
also used a manually crafted vector of boolean values, which indicated the absence
or presence of characteristic words. The model used in the research was SVC, and
the result was a mean F1-score of 71% over the 26 objectives that were classified.
The conclusions that the authors drew from the results were that compliance issues
in mobile privacy policies were common and that their proposed model, along with
a mobile application analysis, can improve privacy transparency.

8


2. Related Work

Furthermore, [17] conducted a study where the researchers analyzed privacy poli-
cies and their completeness towards GDPR compliance. The analysis identified the
absence or presence of metadata types in a text. For example, one metadata type
was processing purposes, which concerned the "purposes of the processing for which
personal data is being collected." The researchers used three different approaches to
classify which metadata types were present in a policy. These approaches were a ma-
chine learning approach using Support Vector Machines (SVM), a similarity-based
classification using cosine similarity of sentence embeddings, and a keyword-based
classification method that compared sentences to keywords related to a specific meta-
data type. The result was that their completeness checking model, which used ma-
chine learning, had an F1-score of 91.47%. Compared to a keyword-based approach,
this method improved the F1-score by 32.35%. Narksenee and Sripanidkulchai [29]
have conducted similar studies using machine learning models to determine if an ap-
plication’s behavior complied with the application’s privacy policy. Thotawaththa
et al. [30] investigated how machine learning models such as BERT could be used to
classify privacy policies. Additionally, Alabduljabbarr et al. [31] conducted a study
with the goal of reducing the read-time on privacy policies from a user perspective.
The researchers utilized machine learning and deep learning models to classify the
content of the policies and reduce the number of paragraphs that the users needed to
read. The models utilized preprocessing steps such as stop-word removal, lemmati-
zation, stemming, TF-IDF, Doc2Vec, Universal Sentence Encoder, and WordPiece.
An ensemble of six machine learning and deep learning models was used to classify
the privacy policies. The result was an F1-score of 91% on their validation dataset,
and after an user study, they concluded that the read-time was reduced by 39.14%.

Liang and Ye [32] conducted a study that aimed to create a classification process
using three-way decisions [33] for inclusive policies. Their proposed model used a
two-stage process. Firstly, an ensemble is trained and output a category with a
confidence value (probability score). The value is then passed through a threshold,
and if the probability score is lower than the threshold, the same data is then passed
to a traditional machine learning model. With this setup, the researchers managed
to achieve greater performance with their three-way decision model compared to ten
other baseline models. The best-performing model used AdaCNN for the first stage
and SVM for the second stage.

These studies give insight into how previous work on policies have been conducted.
All the studies divided the policies into categories and then created a classifier for
each category. However, the researchers utilized different approaches. For example,
Story et al. [28] used a dataset that was annotated by domain experts and used the
categories in the dataset. Thotawaththa et al. [30] used a combination of domain
expert insight and user perspective to choose categories.

2.3 GPT-3 & BERT
In this section, studies related to GPT-3 and BERT is presented; these two models
are also further explained in Chapter 3.

9


2. Related Work

GPT-3, which is a pre-trained deep learning model by OpenAI, has been used for
text classification purposes, such as classifying emails [34], detecting hate speech,
and classifying racist or sexist texts [35]. However, GPT-3 has its limitations. One
study by Moradi et al. [6] investigated if GPT-3 could perform well on text classi-
fication tasks in the biomedical domain. The conclusion was that the model could
not achieve state-of-the-art performance on the chosen NLP tasks when trained on
just a few examples.

BERT, which stands for Bidirectional Encoder Representation from Transformers,
is another pre-trained machine learning model that has also been used in several
studies. To understand how BERT can improve performance in binary classifica-
tion tasks, Zhang and Zhang [36] conducted a study where BERT was used as an
embedding layer for a downstream ML model. The researcher evaluated this model
against benchmarks on the IMDB dataset. The result was that the model that used
BERT as an embedding layer had an F1-score of 93.11%, which was an improvement
of 2.01% compared to the best performing baseline model.

10


3
Background

This section introduces the domain background, i.e., an overview of the informa-
tion security policy and the ISO 27001:2013 standard. Afterward, the necessary
background pertaining to the technical approach is summarized.

3.1 Domain background
Information security policies represent an organization’s ability to safeguard infor-
mation assets proactively. In other words, they are meant to exist as documentation
of an organization’s approach to managing information security. They have, there-
fore, also become acknowledged as an organization’s most crucial information secu-
rity mechanism [37]. Information security alone is about providing "... protection
of information and information systems from unauthorized access, use, disclosure,
disruption, modification, or destruction in order to provide confidentiality, integrity,
and availability" [38]. However, researchers argue that technical implementations
alone are no longer adequate for protecting an organization’s information assets and
need to include more factors, such as the management and employees [8]. Thus,
the information security policy, which provides "... directives, regulations, rules,
and practices that prescribe how an organization manages, protects and distributes
information" [39], has also become recognized as a crucial business document of any
organization [8].

3.2 ISO 27001:2013
Implementing an information security policy alone is not sufficient to safeguard the
organization’s information assets. Although no perfect security and protection plan
exists to this date, proper framework and technique implementation in the shape
of various security standards help in minimizing the risk from harmful exploitation
and establishes the best practices for information security management within an
organization [12][8]. ISO 27001 is an example of such a security standard for infor-
mation management systems and it provides rules and guidelines for organizations
to follow in order to decrease the risk of information and information systems being
exposed [11]. Increasing an organization’s compliance to such standards, therefore,
assists with establishing a robust Information Security Management System (ISMS)
[8].

11


3. Background

The standard, in its entirety, specifies 114 controls divided into 35 control objectives
which are further divided into 14 domains. Where a control is a type of safeguard,
a control objective is a statement that defines the result of implementing said con-
trol/controls, and a domain is a grouping of control objectives that belong to a
specific theme [11] [40]. The need for each control to be implemented is defined by
an ISO 27001 scope set by an organization prior to applying for an ISO certifica-
tion. Hence, only a subset of controls are required to be ISO compliant, but most
commonly, all. Observe the extract from ISO 27001 in Figure 3.1 for an example of
an entire domain.

Figure 3.1: ISO 27001:2013 extract of the domain Human resource security.
The table is an extract from table A.1 in Annex A of SS-EN ISO/IEC 27001:2017
and is reproduced with due permission from SIS, the Swedish Institute for Standards,
who holds the copyright and also sells the complete standard www.sis.se.

In Figure 3.1, the domain is defined as Human resource security (A7) and consists of
three control objectives: A.7.1, A.7.2, and A.7.3. The control objectives pertain to
the different possible statuses of employment. These control objectives then have six
controls: A.7.1.1, A.7.1.2, A.7.2.1, A.7.2.2, A.7.2.3, A.7.3.1. The implementation of
these controls is then what is required to fulfill the control objectives and ultimately

12


3. Background

also be compliant with the domain [18]. Furthermore, each control is supplied
with implementation guidance in an ISO 27001 completing document known as ISO
27002:2013. For an example of such guidance, observe the extract from ISO 27002
provided in Figure 3.2.

Figure 3.2: ISO 27002:2013 extract of the implementation guidance for the control
A.7.3.1 defined in ISO 27001.
The text is taken from SS-EN ISO/IEC 27002:2017 and is reproduced with due
permission from SIS, the Swedish Institute for Standards, who holds the copyright
and also sells the complete standard www.sis.se.

In the Figure 3.2, the implementation guidance provides more information regard-
ing what the implementation of the control related to termination or change of
employment responsibilities should cover. However, the guidance is not tailored to
the control requirement of organizations, and its implementation may also not be
sufficient to pass a certification [19].

3.3 Natural language processing
Natural Language Processing (NLP) is an area within AI that is concerned with
using computational science to process natural language data to enable machine
learning model construction. More specifically, by utilizing various NLP tools, any
human language set of documents can be processed and represented in numerical
forms that can be used in conventional data analysis or machine learning techniques
[41]. The relevant NLP tools and techniques used in this study are mainly related
to text normalization and text vectorization. Where text normalization defines a
standardized word format, and text vectorization maps the elements of a text to a
numeric format.

13


3. Background

3.3.1 Text normalization
Text normalization consists of a set of tasks pertaining to converting natural lan-
guage text to a simplified and standard format that enables comparison with other
normalized texts by eliminating various redundancy and anomalies through gen-
eralization. Among these tasks, a few are commonly found in any normalization
processes and are also relevant to this study. These are mainly word tokenization
and word format normalizing [42].

The task of word tokenization is the simplification part of text normalizing and con-
sists of segmenting (or tokenizing) words from a running text and adding these to a
comprehensive set (or vocabulary). This is also known as a text parsing operation
and acts as an entry point toward word format normalizing. Meanwhile, word for-
mat normalizing comprises tasks that change the segmented words into a standard
format that is defined by a chosen pipeline of tasks [42]. Case folding, stop-word
removal, and lemmatization are a few examples of what could be included in that
pipeline. Refer to the list below for a description of each task.

Case-folding: Maps all letters to lower case such that, for example, Policy and
policy are represented the same. Case-folding, due to its simplicity, has been
recognized as a common practice among practitioners, and thus its usage also
enables popular NLP libraries, packages, and word lists [43] [44]. However, a
disadvantage of using case-folding is the inherent ambiguity [42]. For example,
words such as GloVe a method for learning word embeddings and glove a
clothing item would be considered the same.

Stop-word removal: Removes a class of words known as stop-words. Stop-words
are words that are frequently present in any text, such as has and a. There-
fore, their presence is trivial in most use-cases. The removal can be done by
removing stop-words by using a predefined list or by removing a top percentile
of words in any vocabulary set [42].

Lemmatization: Maps all variation of a word to its corresponding root (or lemma)
[42]. For example, has, had, have, and having are mapped to their shared
lemma have and recognized and recognize to its lemma recognize.

For a visual representation of each task, observe the example given in Figure 3.3. In
the Figure, the first row of the first column demonstrates the tokenization process.
In contrast, the second column represents the output of each previously mentioned
task with the tokenized text as input. Finally, the second row of the first column
provides an example output of how a result of the text-normalization process could
look like if all the tasks were to be used in a pipeline.

3.3.2 Text vectorization
Text vectorization consists of a set of tasks in NLP pertaining to mapping words or
sentences from a text to a vector within a predefined vector space, also known as a
vector space representation. This process is also more commonly known as a word
embedding or embedding technique. The embedding may take on different repre-

14


3. Background

Figure 3.3: Text normalization example of processing a sentence with each of the
following NLP tasks: tokenization, case-folding, stop-word removal, and lemmetiza-
tion.

sentations depending on the vector space utilized and the corresponding embedding
technique used to map to it [42]. Moreover, the embedding techniques can also be
context-insensitive and context-sensitive [45].

Context-insensitive embedding techniques such as Frequency and Sequence based
representations primarily deal with mapping single words to single vectors [45].
Whereas a frequency-based representation has a word frequency-based mapping
while a sequence-based representation has a mapping with a focus on sequencing
sets of words [42] [46].

Context-sensitive embedding techniques such as contextual based representations,
on the other hand, maps multiple contexts for the same word to multiple vectors
[45].

Although all embeddings use multidimensional vectors and can be coupled with ma-
chine learning algorithms, their success in various practical applications and insights
gained from these applications may differ [46]. For a simple example of how a basic
frequency based embedding could look like, observe Figure 3.4.

Figure 3.4: Text vectorization example using a frequency based approach.

15


3. Background

In Figure 3.4, the box on the top represents the input into the vectorizer. The
middle box is the vocabulary that stems from the input. Finally, the box in the
bottom is the resulting vector from processing the input in relation to the vocabulary.
This approach simply counts all the occurrences in the document and is therefore
considered a frequency based representation.

3.3.2.1 Frequency based representations

Frequency-based word embedding is the most commonly used vector space repre-
sentation and uses an approach of counting word occurrences to construct sparse
multidimensional numeric vectors. More specifically, a set of words is mapped to a
matrix where each word corresponds to a column in the matrix, and its frequency
is contained in the rows [46].

The resulting frequency-based matrix is characterized as being high-dimensional and
sparse. This is due to large vocabulary sizes that directly correspond to a large set
of columns (i.e., words) where most of the rows (i.e., word frequencies) are zero since
each document only contains a small subset of the comprehensive vocabulary [46].
Furthermore, an important observation is that a frequency approach considers the
order and position of words as irrelevant. Hence, this approach is also known as
a bag-of-words approach [47]. An example of word embedding techniques that uti-
lize the bag-of-words approach is the Term Frequency-Inverse Document Frequency
(TF-IDF) vectorizer.

The TF-IDF vectorizer uses the product of two terms. The first term is the term
frequency, i.e., the frequency of a word in a text, and the second term is the inverse of
the document frequency, i.e., the presence of a word across all documents. Hence, the
values are either zero or a positive real value. While the first term alone is sufficient
in certain applications, the second term provides a normalization factor such that
words that appear in few documents are given a higher weight [42]. However, using
both the terms may also give higher weight to errors and misspellings that were not
captured during the preprocessing step. Therefore, the choice of whether to use a
pure term frequency or TF-IDF can be application, and corpus specific [46].

3.3.2.2 Sequence based representations

Sequence-based word embedding is built on the notion of a distributional structure
suggested by Harris [48] which states that similar words tend to occur in similar
contexts. Hence, this embedding class uses an approach of capturing useful syntac-
tic and semantic properties in a given text in order to construct its vectors [42].

In contrast to the frequency-based representation, a sequence-based word embedding
investigates the likelihood of a set of words ending up near one another by training
a machine learning prediction model. The learned weights are then transferred to
the word embedding matrix. As a result, unlike a frequency-based word embed-
ding, which utilizes the entire vocabulary, the sequence-based one registers fewer
properties. Therefore, the resulting matrix from sequence-based word embedding is

16


3. Background

characterized as dense, short, and can contain any real-value [42].

Two examples of prominent sequential word embedding techniques are Word2Vec
and Global Vectors (GloVe). Although both use local context to capture various
word semantics, i.e., semantics between words within a defined set size, GloVe takes
it one step further to include global context. More specifically, GloVe also attempts
to find semantic relationships between the words on a corpus level by utilizing global
corpus statistics such as word co-occurrence probability ratios [45]. However, the
techniques also have a shortcoming of poor prediction on words that have previously
not been seen, i.e., out-of-sample predictions.

3.3.2.3 Contextual based representations

Contextual-based representations provide representations of words in context. Un-
like frequency and sequence-based representations that utilize a single vector embed-
ding per word, contextual-based representation yields an entirely new vector every
time a word is encountered in a new context. The vector is then a representation of
a specific word type in a specific context. This embedding can be used to compare
differences between two words in a context and determine their similarities [42].

3.3.3 Pre-trained models

In recent years, researchers have found out that pre-trained NLP models often out-
perform models that have been built from scratch by using task-specific corpora.
More specifically, these are pre-trained models that have learned relevant informa-
tion from large sets of corpora prior to being applied to a new practical application
[49]. This discovery has not only led to an increase of publicly available pre-trained
word embedding models [50], but has, together with transformer models and contex-
tual embeddings, also enabled a new functionality referred to as few-shot learning.

Few-shot learning models require only a few samples to achieve good performances.
Its emergence is built on predecessors that, with the years, have become more ad-
vanced and inherits increasingly more parameters and improvements in terms of
successful NLP task accomplishment. They are then trained on enormous corpora
to require little fine-tuning and still be able to achieve state-of-the-art performance.
Observe Figure 3.5, for an overview of the size difference between a few pre-trained
language models. In the figure, the two models GPT-3 by OpenAI [1], and BERT
by Google [5] are presented along with their sizes and are of interest to this study.

17


3. Background

Figure 3.5: A graph displaying different NLP models along with the number of
parameters for each model [1], [2], [3], [4], [5].

3.3.3.1 Transformers

One of the main enablers of large pre-trained models is the neural network ar-
chitecture Transformer, as described by Vaswani et al. [51]. The purpose of the
Transformer, as described by the authors, was to create an architecture that was
less complex and more efficient, compared to models utilizing recurrence and convo-
lutions. Therefore the Transformer utilizes the Attention mechanism, which has the
advantage of only requiring O(1) sequential operations, while a Recurrent Neural
Network requires O(n) operations, where n is the sequence length. Furthermore, it
uses a multi-headed self-attention operation over the input context tokens followed
by position-wise feed-forward layers to produce an output distribution over the tar-
get tokens. This has been shown to outperform other machine learning models on
various tasks such as machine translation and language modeling.

The self-attention layer in the Transformer architecture builds on the attention
mechanism proposed by Bahdanau et al. [52]. The self-attention layer allows the
model to simultaneously attend to different parts of the input sequence. It has been
used in several works [53][54][55] but then often in combinations with RNN. Vaswani
et al. proposed that there is no need for using RNNs and only the attention mech-
anism is enough.

In the Transformer model, Vaswani et al. used Scaled Dot-Product Attention which
is defined in 3.1. In the equation, Q, K, and V represent a Query, a Key, and a Value,
each of these values are words from the input sentence. dk represents the keys of
dimension. Scaled dot-product attention is almost identical to normal dot-product
attention apart from the use of the factor 1√

dk
. Vaswani et al. motivated this factor

by indicating that for large values of dk, the dot-product itself will grow very large.
The Attention score in 3.1 is calculated for each word in the input.

18


3. Background

Attention(Q, K, V ) = softmax(QKT

√
dk

)V (3.1)

A simplified overview of the Transformers architecture can be seen in Figure 3.6.
The use of shifting the output, along with the masked multi-head attention layer,
ensures that output prediction only relies on inputs which are previous to the output.
[51].

Figure 3.6: A simplified overview of the transformer’s architecture.

3.3.3.2 GPT-3 by OpenAI

GPT, which stands for Generative Pre-trained Transformer, was first introduced
by Radford et al. [56] in 2018. This research aimed to create a model that could
achieve strong natural language understanding without the need for large changes
when applying the model to different tasks such as entailment tasks, similarity tasks,
question answering tasks, and commonsense reasoning tasks. GPT uses a multi-layer
Transformer decoder model [51] due to its excellent transfer performance on differ-
ent tasks.

The training phase of GPT consisted of two stages. Firstly it is trained on a large
corpus of unlabeled data. Secondly, the model’s parameters are adapted using dis-
criminative fine-tuning. To accomplish the first stage, the researchers use standard
language modeling to calculate the likelihood L, which depends on the conditional
probability P . The conditional probability is modeled using a neural network where

19


3. Background

the parameters are trained using Stochastic Gradient Descent. For the second stage,
the goal is to maximize the likelihood seen in 3.2

L2(C) =
∑
(x,y)

P (y|x1, ..., xm) (3.2)

Where C is a labeled dataset, x is the input tokens, and y is the labels. The model
was trained on the BookCorpus dataset, containing 7000 unpublished books. Then,
to benchmark the model, it was fine-tuned on another set of data depending on the
task. This resulted in GPT achieving state-of-the-art performance on 9 out of 12
datasets.

However, GPT was only the first iteration of this model. In 2019 Radford et al. [3]
released a new study where they had conducted further research to create a new
model, which was called GPT-2. GPT-2 also uses the Transformer architecture,
and its foundation is similar to the original GPT, however, with a few changes.
Most notably, the researchers’ largest version of GPT-2 contains 1.5 billion param-
eters and 48 layers. Furthermore, the vocabulary was expanded, context size was
increased, and a larger batch size was used. The model was trained on the WebText
dataset, which the researchers created using web scraping techniques focusing on
retrieving only high-quality documents. The resulting dataset consisted of over 8
million documents [3]. GPT-2 achieved state-of-the-art performance on 7 out of 8
studied datasets in its zero-shot setting, meaning that it was not fine-tuned on any
training data before evaluation.

The most recent iteration of GPT, called GPT-3, was proposed by Brown et al. [1]
in 2020. The trend continues to expand the number of parameters for the model.
In total, eight models were created, with the number of parameters ranging from
125 million to the largest having 175 billion parameters[1]. The largest model is the
one known as GPT-3. Compared to GPT-2, a few modifications were made, but it
still follows the same architecture. It was trained on a larger dataset than GPT-2
and nearly matched performance to fine-tuned models on benchmark datasets. The
authors noted that this is a promising result due to GPT-3 only requiring 10-100
examples in its few-shot setting to achieve good performance. Comparing this to
fine-tuned models, which can require training labeled datasets with hundreds of
thousands of examples.

3.3.3.3 Using GPT-3

GPT-3 cannot be used in an offline fashion compared to models such as BERT and
models from SciKit-learn. These models can be downloaded to the computer and
trained, evaluated, and validated without an internet connection. GPT-3 is different
since its primary method of communication is an API. The API gives the user access
to OpenAI’s file uploading system, use of their models, fine-tuning of models, and
also the creation of embeddings.

GPT-3 has a playground mode which will not be used in this study since it is in-
effective when classifying larger quantities of data. The playground mode presents

20


3. Background

a prompt to the user where it is possible to input tasks to GPT.3. GPT-3 can be
asked to complete a sentence, classify an animal, or translate something from one
language to another. A example of this can be seen in Figure 3.7. The "=>" sign,
which is seen in the Figure, is called the separator, and it tells GPT-3 where the
task ends. This sign can be chosen arbitrarily as long as it is not present anywhere
else in the prompt. The playground mode will not be used largely in this study.
However, it can be good for demonstration purposes and to explore how GPT-3
behaves when certain tasks are fed.

Figure 3.7: The GPT-3 playground where the user can input tasks. In this case
GPT-3 was asked to validate if a given code line is valid in the programming language
Python. The green text is GPT-3’s response.

Furthermore, OpenAI provides an API for communicating with GPT-3. The API
can be accessed by OpenAI’s Javascript library, Python library, and CURL com-
mands. These methods make it possible to input classification tasks using code
rather than the playground prompt. This study will mainly focus on using the
Python library rather than the playground prompt, as it enables multiple requests
without pasting the tasks into the playground prompt.

3.3.3.4 Using GPT-3 as a classifier

GPT-3 can be used as a classifier for simpler tasks in its text completion mode.
In Figure 3.8, GPT-3 is used to classify if a language is either an object-oriented
language, or a functional language. This classification works well, and if the use case
were to use GPT-3 as a language classifier, it would work. However, when tasking
GPT-3 with more complicated domains, such as information security policies, its text
completion mode is insufficient. This is displayed in Figure 3.9 where it determines
that Encryption is not part of ISO 27001:2013, even though it is. For this reason,
this study will use GPT-3 in its classifier setting instead.
Example of using GPT-3 as a classifier 1 GPT-3, in its few-shot setting, works
a bit differently than its text completion mode. The whole procedure can be seen in
Figure 3.10. The main differences to text completion are that examples of the data
need to be uploaded to OpenAI. GPT-3 then uses the most relevant example data
to classify the input.
Using GPT-3 as a classifier can be divided into the following steps,

1. Format the data in JSONL format
1https://beta.openai.com/docs/guides/classifications

21


3. Background

Figure 3.8: The playground prompt where GPT-3 is used as a classifier in its text
completion mode. The green text is GPT-3’s response.

Figure 3.9: The playground prompt where GPT-3 is used as a classifier in its text
completion mode. The green text is GPT-3’s response. Here GPT-3 fails to classify
encryption as a part of the ISO 27001:2013 standard.

2. Upload the data to OpenAI, the API will respond with a file id corresponding
to the uploaded data. This needs to be saved for later use.

3. Use either CURL commands, the Python library, or the Javascript library to
send a classification request to GPT-3.

In order to improve the classification shown in Figure 3.9, the steps mentioned can
be used as follows.

1. First, some example data needs to be provided, which will be created manually
for this example. The input to GPT-3 will be of the format "Is [X] part of ISO
27001:2013?". The sample data can be seen in table 3.1, which will need to be
formatted into JSONL2 before submitting it to OpenAI

2. Using Python with the openai library will submit the data to OpenAi.
openai.File.create(file=open("example_data.jsonl"),
purpose="classifications").

3. Then the following code will submit the query to GPT-3 for classification.
model = openai.Classification.create(
file=fileid,

2https://jsonlines.org/

22


3. Background

Figure 3.10: The four step procedure taken by GPT-3 when used in its few-shot
setting. Source of the image is https://beta.openai.com/docs/guides/classifications

query="Is Cryptography part of ISO 27001:2013?",
search_model="ada",
model="curie",
max_examples=5
)

4. The response from GPT-3 can be seen in Figure 3.11. GPT-3 can now predict
that Cryptography is part of ISO 27001:2013. The score indicates how useful
GPT-3 deems the example to be. All examples seem to have a rather high
score, indicating that GPT-3 is somewhat unsure of how to use these examples.
However, this small example still gives insight into how GPT -3’s classification
works.

.

Text Label
Is User Access Management part of ISO 270001:2013?### Yes
Is Information Classification part of ISO 270001:2013?### Yes
Is geography part of ISO 270001:2013?### No
Is handling of squirrels part of ISO 270001:2013?### No

Table 3.1: The data used in the example.

3.3.3.5 BERT by Google

BERT, which stands for Bidirectional Encoder Representation from Transformers, is
another pre-trained model similar to GPT-3 but uses a different architecture. BERT
was originally proposed by Devlin et al.[5]. The model is designed to pre-train deep
bidirectional representations from unlabeled texts by jointly conditioning on both
left and right contexts in all layers. BERT was built on top of the Transformer, an
attention-based model that learns contextual relations between words in a text. The

23


3. Background

Figure 3.11: Response from GPT-3 when sending the query "Is Cryptography part
of ISO 27001:2013?"; done using the python library OpenAI. GPT-3’s answer can
be seen at line 3 where it labels the input as Yes. In the selected_examples column
returns how useful the uploaded examples were in classifying the input. Higher score
means more useful.

Transformer was initially designed for machine translation, but BERT is adapted
for natural language understanding tasks such as sentence classification, question
answering, and next sentence prediction. BERT represents words in a text as vectors,
or "embeddings". These embeddings are learned jointly with a two-layer bidirectional
Transformer encoder. The Transformer encoder reads the text input sequentially
and learns to predict words that are masked (replaced with [MASK]) or randomly
shuffled. The training process of BERT is unsupervised, meaning that it does not
require labeled data. This makes it more efficient and scalable than previous models
trained on labeled data. BERT is effective because it can capture the context of a
word in a sentence, rather than just the word itself. This is due to the bidirectional
nature of the Transformer encoder.

3.4 Support Vector Machine (SVM)

A support vector machine (SVM) is a kernel linear classifier [57]. SVM works in all
dimensions [58], but for simplicity’s sake, this section only considers the linear case.

SVMs are based on the concept of dividing the data into classes by using the best
fitted decision boundary [57], and this can be observed in Figure 3.12. In higher
dimensions, the decision boundary is referred to as a hyperplane. For example, if
the data were in three dimensions, the decision boundary would be a plane.

The equation for calculating the decision boundary is given by equation 3.3. In the
equation, w are the weights, b is the bias, and x is the decision point. +1 and -1

24


3. Background

are used to label the two classes.

wTx + b =

≥ 0 class +1
< 0 class -1

(3.3)

However, using this form of classification can lead to misclassification if the points
are close to the boundary line with small changes to x [57]. Therefore to make the
model more robust, a parameter ϵ can be added, which denotes how far the closest
data point can be separated from the decision boundary. This is also referred to as
a margin and the SVM model will attempt to fit a hyperplane that can maximize
it. Reasoning behind this is that the larger a margin is, the larger the distance is
between the classes, and therefore the easier it is to differentiate between them. The
decision boundary can then be written as equation 3.4 and can also be observed in
Figure 3.12.

wTx + b =

≥ ϵ2 class +1
< −ϵ2 class -1

(3.4)

Figure 3.12: The SVM classifier using a boundary line to separate and classify the
datapoints.

25


3. Background

26


4
Research Design

This study followed the research method Design Science Methodology (DSM) as de-
scribed by Wieringa [59]. The purpose of DSM is to identify a problem, suggest an
artifact that may improve the problem context, and validate whether the artifact
operates as intended. This process is defined as a Design Cycle and is to be per-
formed as an iterative process. The design cycle is part of a larger problem-solving
cycle defined as an Engineering cycle. Wieringa defines the engineering cycle with
the following five tasks:

1. Problem investigation
2. Treatment design
3. Treatment validation
4. Treatment implementation
5. Implementation evaluation

The design cycle is task one through three of the engineering cycle and may be
performed numerous times before attempting a real-world attempt of the treatment,
i.e., task four and five. For a visual representation of the engineering and design
cycle, refer to Figure 4.1.

Figure 4.1: The engineering and design cycle as defined by Wieringa.

The methodology behind this study had a design cycle focus, and any real-world
treatment implementation and evaluation are left as possibilities for future work.

27


4. Research Design

Hence, this section also follows the design cycle structure by outlining the problem
investigation and the methodology utilized to design and validate an artifact. Al-
though the design cycle is an iterative process, this section discusses the tasks in a
linear fashion.
In the problem investigation section, the problem context is introduced as provided
by the case company. The problem context is then further investigated for the pur-
pose of identifying and defining the problem. After the investigation, an artifact that
interacts with the defined problem is suggested in the treatment design stage. Fi-
nally, in treatment validation the validation methods for the suggested artifact that
were used are presented. Furthermore, a discussion of how the research questions
are intended to be answered is provided in this section.

4.1 Problem Investigation

The case company, which specializes in software solutions and policy compliance,
defined their problem context as information security policy content review in rela-
tion to ISO and expressed a desire for improvement. Hence, the problem context was
investigated together with the experts at the case company by conducting continu-
ous knowledge transfer sessions in the form of meetings, where frequent discussions
regarding their difficulties and work processes were held. These sessions were per-
formed in order to add more background to the problem context and the current
solution. For more details on these sessions, refer to Section 4.4.
Among the identified difficulties was the time-consuming, labor-intensive, and error-
prone process of manually checking for completeness of each ISO control (as de-
scribed in section 1.1). Furthermore, the accepted level of information security that
is defined by ISO is rather vague and may be formulated differently depending on
the characteristics of an organization, despite using the same standard. Automating
the completeness checking process by building machine learning models that are
trained on the different variations can save a lot of time for the expert performing
the information security policy review in relation to ISO. The saved time can in-
stead be allocated towards controls that either fail the completeness check or are
more difficult to classify.
Hence, the problem investigated in this study was to determine the completeness
of information security policies in relation to the ISO 270001:2013 standard utiliz-
ing different classical and modern NLP and machine learning methods. The main
stakeholders affected by the study were any organization concerned with information
security, policy experts, and researchers from, but not limited to, the disciplines of
software engineering and applied artificial intelligence. The project has the possibil-
ity of providing organizations with an affordable method to achieve an acceptable
level of information security and/or detect shortcomings in an information security
policy in relation to ISO. To establish a solution for this problem context, informa-
tion security policies needed to be gathered in order to evaluate a machine learning
model and framework that ultimately became the design artifact of this study.

28


4. Research Design

4.2 Treatment Design
This section provides details on the approach used by the study for designing an
artifact. More specifically, descriptions of understanding and preparations of data,
the creation of baseline and benchmark models, and GPT-3 models are provided
here. Furthermore, a discussion of what a good quality machine learning framework
could be, i.e., RQ1, is also provided.

4.2.1 Data understanding
The ISO 27001:2013 standard consists of 114 controls belonging to 35 control ob-
jectives which in its turn belong to 14 domains. In other words, a domain is at the
highest level, control objective is at the middle level, and a control is at the lowest
level. Hence, the possible levels of data granularity to use for model learning are
those three, and a large portion of all the domains should be present in any ISO
certified information security policy, as established in 3.2. However, the structure of
information security policies, regardless of ISO certification, can look quite different
from one policy to another.

The differences are not only due to a normalized structure not being used, but also
organizations’ unwillingness to share their policy at the lowest level as information
security research has been defined as “one of the most intrusive types of organiza-
tion research" [60]. Hence, the lowest level of an organization’s information security
policy may often be hidden, and its higher levels may be too different from other
policies, and/or the information may be too spread out across the policy to define
where it is covered. Which ultimately lead to a major realization and decision for
the course of this study.

The realization is the fact that it may be impossible to measure the completeness of
an information security policy in relation to all the 14 domains, 35 control objectives,
or 114 controls. Mainly due to the reasons mentioned above, but also due to time
limitations. Therefore, for this study, the focus shifted to establishing a proof-of-
concept of measuring completeness in relation to a few controls, control objectives,
or domains, rather than the entirety of the ISO standard. The collection of controls
were carefully selected together with an expert that had an insight into what content
was most often present in any information security policy without having access to
the lowest levels.

4.2.2 Data collection and overview
The data collection was done in iterations. It started out with finding information
security policies that were publicly available on organization websites and later also
moved to contacting organizations through customer support or directly through
phone or email. When searching for policies on the internet, search strings such
as information security, information security policy, isp, public information security

29


4. Research Design

policy, and security policy were used, see Appendix A.1 for the full list.

When contacting customer support through email, a pre-written email was sent,
which disclosed who the authors were, what the purpose of the data collection was,
and that anonymity would be guaranteed if they decided to share their policy, see
Appendix A.2 for the template. When choosing companies to contact, a randomly
selected subset of 30 was used from the list of the 100 largest companies in Sweden
by turnover1. The full list of contacted companies can be observed in Figure 4.2.

Figure 4.2: The full list of contacted companies. In the Name column, the
contacted company’s name is listed. In the Response column, the results from
the exchanges are listed. The cells were marked green if a response was received
where they disclosed that they would, or would not, share their policy, or if they
would redirect us. From two companies information security policies were received,
therefore their name was redacted to [Company].

The industries and geographical location of the organizations were diversified, where
some had more exposure than others. For example, the leading industry from the
gathered dataset was Academics, and the leading country was the UK.

As for the structure of the documents, it varied greatly from policy to policy. The
assumptions made in the previous section were quickly found to be true as the policy

1https://www.largestcompanies.com/toplists/sweden/largest-companies-by-turnover

30


4. Research Design

content could, for example, be in bullet points, clearly marked out sections, or in
plain text. Few policies also referred to internal documents to which access was only
allowed to authorized personnel.

The biggest difficulty faced with data gathering was the sparse publicly available
data that could be collected, but also confirming ISO certifications. This was made
through third-party websites that have a record of ISO 27001:2013 certified compa-
nies and extensive searching on company websites.

In total, the dataset that was later annotated contained 49 information security
policies. The distribution of the dataset with respect to labels was fairly even and
can be observed in Figure 4.3.

Figure 4.3: A bar-plot of the dataset containing ISO and non-ISO data points.

4.2.2.1 Data annotation

After a data gathering iteration had been performed, the data was extracted and
annotated by the authors according to a set of controls and also labeled if it was ISO
certified or not. The set of controls was selected together with a policy expert such
that the annotations could be carried out by non-experts since it is a time-consuming
task. Hence, the data was annotated by two non-experts and was annotated twice
per selected set of controls in order to avoid conflicting perspectives. Observe Figure
4.4 to see the template data-sheet that was used for data annotations in this study.

Figure 4.4: The template used for data annotations including an example for the
sake of illustration.

31


4. Research Design

In Figure 4.4, the first column corresponds to the file name in the dataset and exists
to not confuse with un-annotated files. The first column was also immediately
dropped before any data processing begun. The second column corresponds to the
ISO label of the information security policy. The third column represents whether
the control was existent in the policy or not. If it was not found and the policy
was labeled as a non-ISO, a random number generator was used for the selection of
random text somewhere in the policy. The reasoning behind this was to fill out with
data that the model could still learn from. Conversely, if the control was not found
and the policy is labeled as an ISO, that extract would be left empty to avoid biasing
the model. The fourth and fifth column were used as a method for reducing bias in
the annotation process. The first annotator extracted the text and labeled the data
point, then put a "1" in the Annotation_1 column to indicate that the data point
had been annotated once. The second annotator then reviewed this annotation, if
the second annotator agreed, then the data point was seen as completed. However,
if the annotators disagreed, then the data point was discussed until a consensus was
reached or be brought up with a policy expert from the case company (see section
4.4). Finally, the sixth column is the extract from the policy that was later used by
the models.

4.2.3 Modeling
In order to answer RQ1, a machine learning framework needed to be established
and the decision was to be made based on the factors of data and manual labor
needed at each data granularity level. The data granularity levels are defined by
the level of detail in the texts [61]. For example, a domain level is a very high-level
statement since it contains the least amount of detail.

The need for data to be sufficient in training a successful model is determined by
an estimation of the number of controls covered and their inherent complexity. To
illustrate, a domain that contains the most controls, each with their own complex-
ity, is estimated to have a higher need for data than a singular control. In other
words, the less detailed level of data granularity chosen, the more data is needed
to cover the inclusion of additional controls. This argument is founded in that the
more controls are included, the more variance and variables have to be taken into
account and thus would have a higher data demand. On the other hand, the more
detailed level of data granularity chosen, the tougher it is to identify and annotate
without the help of experts.

After many discussions, a consensus was reached together with experts that the
finest level of data granularity, i.e., a control level, was the most preferred level of
granularity. This choice was made on the basis that a control-level model brought
the most benefit for the experts as that was where the completeness checking was
most often made. Additionally, this put less pressure on the already scarce dataset.
To avoid difficulties with annotations, the controls were selected such that non-
experts could also identify them with a few guidelines from the experts.

32


4. Research Design

The final architecture of the framework is presented in Figure 4.5 and remained to
be evaluated to deem whether the estimated factors were sensible and if it performed
better than baseline models.

Figure 4.5: Framework architecture for information security policy control classi-
fication in relation to ISO.

In Figure 4.5, the input to the framework is defined in the box to the left. It consists
of an user inputted selection of control and an extract corresponding to the control
from the information security policy in question. The box in the top right represents
the machine learning pipeline and consists of an NLP pipeline that is combined with
a binary machine learning classifier. Finally, the output in the box to the bottom
right is the result from the classifier and outputs either a one for the control having
ISO level content and zero if it does not.

4.2.3.1 Control selection

The selected controls were selected together with a policy expert with an additional
assumption that the control was present in most public information security policies.
Furthermore, the selected controls also varied in factors that needed to be complete-
ness checked against. In other words, this meant that the length, wordiness, and
different ways to formulate the extracts varied between the controls. The purpose
behind this was to measure the model’s success with even the more difficult controls
to check completeness against. The selected controls are listed in Table 4.1.

4.2.3.2 Data preparation and models

The data was prepared before being processed by machine learning models by first
applying it through an NLP pipeline. Whilst there were many different algorithms
and packages to utilized for each task in the pipeline, the overall process is as
depicted in Figure 4.6.
In Figure 4.6, documents (or control extracts from information security policies)
are inputted into a machine learning pipeline. The pipeline itself consists of two
sections. The first is the NLP pipeline, and the second is to combine it with a ma-
chine learning classifier. In combination, an output of a classification label should be

33


4. Research Design

Control id Control name Control definition
A.5.1.2 Review of the poli-

cies for information
security

The policies for information security shall be re-
viewed at planned intervals or if significant changes
occur to ensure their continuing suitability, ade-
quacy, and effectiveness.

A.7.2.2 Information security
awareness, education
and training

All employees of the organization and, where rele-
vant, contractors shall receive appropriate aware-
ness education and training and regular updates in
organizational policies and procedures, as relevant
for their job function.

A.9.2.3 Management of priv-
ileged access rights

The allocation and use of privileged access rights
shall be restricted and controlled.

Table 4.1: Table of selected controls.
The table is an extract from table A.1 in Annex A of SS-EN ISO/IEC 27001:2017
and is reproduced with due permission from SIS, the Swedish Institute for Standards,
who holds the copyright and also sells the complete standard www.sis.se.

expected. The NLP pipeline comprises of three tasks: text parsing, text normaliza-
tion, and text vectorization. Meanwhile, the machine learning classifier consists of
a single binary classifier. For a more detailed view of each task within the machine
learning pipeline, observe Figure 4.7.

In Figure 4.7, text parsing is defined by tokenizing the text and acts as an input to
the text normalization task, which has the three sub-tasks: case-folding, stop-word
removal, and lemmatization. For case-folding, the python standard library string
operations were used. For the stop-word removal and lemmatization, a pre-defined
list of stop-words and a lemmatizer algorithm known as the WordNetLemmatizer,
of which both came from the NLTK-library, were used. The following task, text vec-
torization, consisted of the three different types of word embedding models defined
in 3.3.2. The TF-IDF vectorizer from the scikit-learn library was selected to rep-
resent the frequency-based model. For the sequence-based models, Word2Vec and
GloVe models from the Gensim library were selected. Finally, GPT-3 from OpenAI
and BERT from the TensorFlow library were selected for the contextual-based mod-
els. The output from the NLP pipeline was then inserted as input into the binary
machine learning classifier, which was chosen to be the LinearSVC model from the
scikit-learn library. The LinearSVC model was essentially a SVM that used a linear
separator as described in 3.4. GPT-3 and BERT, on the other hand, did not need
a stand-alone binary classifier as the models handled the classification internally.

The main reasoning behind the selected combination of sub-tasks used for text
normalization was to mimic the preprocessing of other pre-trained word embed-
ding models [43] and maximize the gain from their usage by aligning the tokenized
words. These pre-trained word embedding models were mainly related to the se-
quential embeddings of Word2Vec and GloVe, where Gensim offered a wide variety
of them. For this study, the word2vec-google-news-300 and glove-wiki-gigaword-300

34


4. Research Design

Figure 4.6: The overall machine learning pipeline. From documents, into the
machine learning pipeline which consists of an NLP pipeline and a machine learning
classifier, and yields an output.

Figure 4.7: Detailed view of the machine learning pipeline in Figure 4.6 including
sub-tasks and models used.

pre-trained embeddings were selected [50]. The word2vec-google-news-300 was a col-
lection of 300-dimensional pre-trained word vectors and used Word2Vec as a basis
for its learning technique. The model had been extensively trained on text corpora
from Google News and had a total of three million vectors. Similarly, the glove-wiki-
gigaword-300 was also a 300-dimensional pre-trained collection of word vectors but
used GloVe as a basis for the learning technique. The model had been extensively
trained on text corpora from Wikipedia, an online encyclopedia, and English Giga-
word, which was an archive of newswire text data. Finally, glove-wiki-gigaword-300
had a total of 400 000 vectors [50]. These two pre-trained models are referred to as
pre-trained Word2Vec and pre-trained GloVe henceforth.

35


4. Research Design

4.2.4 Using GPT-3
In this section, the methodology for using GPT-3 is presented. Firstly, the method
for the few-shot mode is shown, which includes how the data was uploaded to the
model and the different parameters. Secondly, the method for how the fined-tuned
model was used is presented.

4.2.4.1 Uploading data

For both the fine-tuning mode of GPT-3 and the few-shot mode, the data that the
models used needed to be uploaded to OpenAI. At the time, OpenAI only supported
the use of JSONL2 files which were similar to usual JSON files where the difference
is that each line in the file is a valid JSON value. These were structured in the
following way.
{"text":"Text data of ISO certified policy +
\n\n===\n\n", "label":"iso"}
{"text":"Text data of a policy that is not certified +
\n\n===\n\n", "label":"oth"}
Where \n \n === \n \n was used as the separator.

4.2.4.2 Few-shot mode

The data was uploaded to OpenAI for classification. Time.sleep(20) was used since
it took some time for OpenAI to upload and process the text.

def _upload_training_file(self,path):
response = openai.File.create(file=open(path), purpose="classifications")
time.sleep(20)
self.training_file_id = response.id

After the data was uploaded, a model was created, the hyperparameters for the
model can be seen in table 4.2. The hyperparameter File id corresponded to the
uploaded data and changed when using K-fold cross-validation, since one id was
needed for each fold. Prompt was the input data that was classified.

Setting Value
Model name curie
Prompt The text to be classified
Search model ada
Logprops 2
Max Examples 3
Expand Completion
Training file id The id of the uploaded files.

Table 4.2: The hyperparameters of GPT-3, in its few-shot setting, when classifying
Information Security Policies. In this table, logprobs = 2 indicated that the model
returned the logarithmic probability values.

2https://jsonlines.org/

36


4. Research Design

4.2.4.3 Fine-tuning mode

GPT-3’s fine-tuning mode worked similarly to the few-shot mode, where the main
difference was that the fine-tuning mode was not yet supported by the OpenAI
Python library, and therefore the OpenAI CLI and CURL commands needed to be
used instead. The process can be described as follows,

1. Prepare the data: The OpenAI CLI provided commands which prepared
the training data. The command established that the training data had a
separator at the end of each prompt, that the label started with a whitespace,
and that the JSONL file was formatted in the correct way.

2. Create fine-tuned model Using the OpenAI CLI, creating a model was
done by executing the command

openai API fine_tunes.create -t
<TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>

OpenAI responded with a model name which was saved, and this made it so
that the model could be reused unless the model needed to be re-trained.

3. Using fine-tuned model Using the fine-tuned model was done by the fol-
lowing OpenAI CLI command

openai API completions.create
-m <FINE_TUNED_MODEL> -p <PROMPT>

Where FINE_TUNED_MODEL was the model name that was retrieved from
the previous step, and PROMPT was a selected text from the testing dataset.
To evaluate the model, this command needed to be executed for each value in
the testing dataset. Then the returned value was compared to the true value
to compute the F1-score.

4.3 Treatment Validation
In this section, the metrics that were used to evaluate the models are presented, and
methods of validation are described. The model validation in this study was made
in three steps in order to answer RQ2 and a few of its facets. This section follows
the same structure as those steps, with the exception of explaining the metrics and
tools used for comparison first.

The first step in the model validation process was to compare models and word em-
bedding techniques in order to investigate which model features were best suited for
each dataset. To perform this step, a collection of models were trained on a training
set and then evaluated on a testing set. The results were then compared between
the models in terms of F1-scores and accuracy and then analyzed in order to provide
an answer for RQ2.1 and partially for RQ2.2. The second step was to validate the
performance of the framework in combination with GPT-3 by comparing its results
to an annotated validation dataset that had been annotated by a policy expert. This
step was mainly performed to evaluate whether the framework and model functioned

37


4. Research Design

as intended, i.e., if it could aid or replace an expert in determining completeness.
The result from this step complemented the previous results and together provided
a complete answer for RQ2.2.

The third, and final step, was to investigate the use of GPT-3 in other domains, such
as biomedicine, and compare them to the achieved performance in this study in order
to determine whether GPT-3 was a good fit for information security classification.
The results from the other domains was attained from various research papers that
have used GPT-3 for text classifications. The result from this step provided an
answer to RQ2.3 and thus also allowed for a discussion of RQ2.

4.3.1 Metrics and tools
The metrics and tools used to evaluate and validate the various models are defined
in this section. Besides checking the accuracy of how a model performed on a testing
set, the metrics of precision, recall, and F1-score were used along with the method
of K-fold cross-validation.

4.3.1.1 Precision, Recall and F1-score

When dealing with imbalanced datasets evaluation metrics such as Precision and
Recall is preferred [62]. The reason for this is that analyzing how many true positives
a model produces can lead to high accuracy even though the model labels all texts
the same. Therefore in this paper the chosen evaluation metrics were Precision,
Recall, and F1-score. These are based on the Confusion Matrix seen in Table 4.3,
and are defined in 4.1, 4.2 and 4.3.

Predicted Positives Predicted Negatives
Real Positives TP FN
Real Negatives FP TN

Table 4.3: Confusion Matrix, here TP = True Positives, FN = False Negatives,
FP = False Positives, TN = True Negatives.

precision = TP
TP + FP (4.1)

recall = TP
TP + FN (4.2)

F1-score = 2 · precision · recall
precision + recall (4.3)

4.3.1.2 K-fold cross-validation

K-fold cross-validation is a method for validating a model by partitioning the dataset
into k subsets, training the model on k − 1 subsets, and testing the model on the

38


4. Research Design

remaining subset [63]. This is repeated until all subsets have been used as the
testing set. K-fold cross-validation was used across all data sets, with 75% of the
total dataset being partitioned as a training set and the remaining 25% as a testing
set. The final result was calculated as a mean of all results across the k subsets.

4.3.2 Model comparisons
The model comparisons was performed by comparing GPT-3 to various benchmark
models in order to validate its performance and answer the question of which charac-
teristics were desired for the classification of the datasets. These benchmark models
are defined in Table 4.4.

Model
name

Word em-
bedding
type

Word embedding model Classifier

ZeroR - - Zero Rule
Classifier

TF-IDF Frequency TfidfVectorizer (scikit-learn) LinearSVC
Word2Vec Sequential Word2Vec (Gensim) LinearSVC
Word2Vec
(pre-
trained)

Sequential word2vec-google-news-300 (Gensim) LinearSVC

GloVe (pre-
trained)

Sequential glove-wiki-gigaword-300 (Gensim) LinearSVC

BERT Contextual bert_en_uncased (TensorFlow) BERT

Table 4.4: Table of benchmark models with their word embedding types and
models as well as the combined classifier.

In Table 4.4, the model names are given along with their word embedding type,
word embedding model, and classifier coupled with. The same model names is ref-
erenced in the upcoming sections. An additional model that has previously been left
unnamed is also present in the table. The model, referred to as the ZeroR-model,
is a Zero Rule Classifier. A ZeroR Classifier is a classifier model that predicts all
data points to the most frequent class and is a common model baseline to validate
performance against. Additionally, it is also a method for determining if the model
in question is a useful predictor. To illustrate, a ZeroR model predicts the same as
voting just ones or zeros in favor of the class that has the most dominance in any
given dataset and ignores any possible predictors [64]. In the case of this study, the
classifier predicted all data points as "Non-ISO" as the dataset was slightly leaning
more towards Non-ISO labeled data points rather than ISO.

The validation of GPT-3, on the other hand, consisted of evaluating the two mod-
els, few-shot and fine-tuned, on the annotated dataset. For both models, there was
one common parameter, the Model parameter. These consisted of Davinci, Curie,
Babbage, and Ada. The differences between the models were which tasks they could
perform well on, how costly they were to use, and how efficient they were when it

39


4. Research Design

came to training. Overall, Davinci was the best performing model, but it was more
costly and slower to train. On the other hand, Ada was the simplest model and
therefore cheaper to use and trained in less time than Davinci.

In this study, for GPT-3 Few-shot, Davinci and Curie were used as the classifier
models, and Ada and Curie were used as search models. While for GPT-3 fine-
tuned, Davinci and Ada were used as classifier models.

For the few-shot model, three parameters were changed during the validation pro-
cess. There were max_examples, which determined how many examples GPT-3
was given during interference time, and K, which determined how many subsets the
dataset was divided into (see section 4.3.1.2), and lastly the model name. For the
fined tuned model, only the model was changed.

4.3.3 Expert validation with the case company
The expert validation was carried out with policy experts from the case company.
This was done to determine how GPT-3’s predictions compared to experts in infor-
mation security policies. The process consisted of the creation of a new dataset for
each control, referred to as a validation dataset. These datasets each consisted of
ten extracts from information security policies that corresponded to each selected
control which no model had previously seen nor been evaluated on.

The expert validation session was carried out in the form of a demo session where
an expert was quickly briefed on the con