An NLP approach to assess information security policies Application of GPT-3 within a policy domain Master’s thesis in Computer science and engineering Hampus Lundblad Pouya Faramarzi Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Master’s thesis 2022 An NLP approach to assess information security policies Application of GPT-3 within a policy domain Hampus Lundblad Pouya Faramarzi Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 An NLP approach to assess information security policies Application of GPT-3 within a policy domain Hampus Lundblad Pouya Faramarzi © Hampus Lundblad 2022. © Pouya Faramarzi 2022. Supervisor: Miroslaw Staron, Interaction Design and Software Engineering Advisor: Daniel Dalevi, Centiro Examiner: Lucas Gren, Interaction Design and Software Engineering Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2022 iv An NLP approach to assess information security policies Application of GPT-3 within a policy domain Hampus Lundblad Pouya Faramarzi Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Threats to companies’ information security are ever-increasing, and to adequately protect the companies’ information assets; a proper information security policy needs to be established. For this purpose, information security standards such as ISO 27001:2013, created by the International Organization of Standardization, exist. However, for a policy to be complete towards ISO 27001:2013, the policy must fulfill up to 114 different requirements, also called controls. Experts within information security policies often do this work, which can be time-consuming and error-prone. Due to this, this study aimed to use natural language machine learning models to classify if a text extract from a given information security policy is complete to- wards a specified control or not. Ultimately the study wants to investigate whether language models are a good fit for software engineering topics that are also business- critical. The study utilized the design science methodology. A framework for determining policy completeness was constructed and different natural language machine learn- ing classifiers were evaluated. The main focus was on the large-scale pre-trained model GPT-3 by OpenAI. Three different datasets were constructed to train the models, each consisting of annotated text extracts from information security policy. These were labeled as either being ISO certified or not, depending on if the company, or the policy itself, mentioned an ISO certification. The models were then evaluated on these three datasets, where the metrics for evaluation were F1-score and accu- racy. Lastly, a validation session with a policy expert from a case company that specializes in software solutions and policy compliance was conducted to determine how GPT-3’s evaluation of policies compares to the evaluation of an expert. The results showed that GPT-3 and the pre-trained word embedding model GloVe with SVC as a classifier could perform better in policy classification than other machine learning models. However, when compared to an expert, GPT-3 fails to distinguish between policies that are not complete towards ISO and policies that are partially complete towards ISO. Something which the policy expert was able to do. We conclude that GPT-3 has the potential to perform well in the domain of infor- mation security policy. However, due to a lack of data and expertise in the domain of information security policies, the results from the validation session do not reflect this. Hence, the authors provide a discussion regarding this and recommendations for future work. Keywords: software engineering, information security policy, ISO, NLP, OpenAI, GPT-3, machine learning v Acknowledgements We would like to express our gratitude to our supervisor from Chalmers, Miroslaw Staron, who provided valuable and relevant feedback, helped us with the direction of the thesis, and provided support throughout the entire project. We would also like to thank Centiro for offering us the opportunity to pursue our thesis together with them. A special thank you to Daniel Dalevi, Mikael Böörs, Gustaf Stawåsen, and Thomas Herkel from Centiro for their expertise and support. Their feedback and perspective has helped us immensely. Lastly we would like to thank our examiner Lucas Gren, who read and gave us feedback on our thesis and how to complete it. Hampus Lundblad, Gothenburg, June 2022 Pouya Faramarzi, Gothenburg, June 2022 vi viii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Practical scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Report structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related Work 7 2.1 Text classification in Natural Language Processing . . . . . . . . . . . 7 2.2 Policy classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 GPT-3 & BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Background 11 3.1 Domain background . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 ISO 27001:2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Text normalization . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2 Text vectorization . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2.1 Frequency based representations . . . . . . . . . . . 16 3.3.2.2 Sequence based representations . . . . . . . . . . . . 16 3.3.2.3 Contextual based representations . . . . . . . . . . . 17 3.3.3 Pre-trained models . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3.1 Transformers . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3.2 GPT-3 by OpenAI . . . . . . . . . . . . . . . . . . . 19 3.3.3.3 Using GPT-3 . . . . . . . . . . . . . . . . . . . . . . 20 3.3.3.4 Using GPT-3 as a classifier . . . . . . . . . . . . . . 21 3.3.3.5 BERT by Google . . . . . . . . . . . . . . . . . . . . 23 3.4 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . 24 4 Research Design 27 4.1 Problem Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Treatment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Data understanding . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Data collection and overview . . . . . . . . . . . . . . . . . . . 29 ix Contents 4.2.2.1 Data annotation . . . . . . . . . . . . . . . . . . . . 31 4.2.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.3.1 Control selection . . . . . . . . . . . . . . . . . . . . 33 4.2.3.2 Data preparation and models . . . . . . . . . . . . . 33 4.2.4 Using GPT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.4.1 Uploading data . . . . . . . . . . . . . . . . . . . . . 36 4.2.4.2 Few-shot mode . . . . . . . . . . . . . . . . . . . . . 36 4.2.4.3 Fine-tuning mode . . . . . . . . . . . . . . . . . . . . 37 4.3 Treatment Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Metrics and tools . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1.1 Precision, Recall and F1-score . . . . . . . . . . . . . 38 4.3.1.2 K-fold cross-validation . . . . . . . . . . . . . . . . . 38 4.3.2 Model comparisons . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Expert validation with the case company . . . . . . . . . . . . 40 4.3.4 Domain result comparison . . . . . . . . . . . . . . . . . . . . 41 4.4 Weekly meetings with case company . . . . . . . . . . . . . . . . . . 42 5 Results 43 5.1 Model comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 GPT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1.1 Few-shot . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1.2 Fine-tuning mode . . . . . . . . . . . . . . . . . . . . 44 5.1.2 Comparison with benchmark models . . . . . . . . . . . . . . 45 5.2 Expert validation with the case company . . . . . . . . . . . . . . . . 46 5.3 Domain result comparison . . . . . . . . . . . . . . . . . . . . . . . . 51 6 Discussion 53 6.1 Framework evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 GPT-3 and information security policy completeness . . . . . . . . . 53 6.2.1 GPT-3 compared to benchmark models . . . . . . . . . . . . . 53 6.2.2 GPT-3 compared to expert results . . . . . . . . . . . . . . . 54 6.3 GPT-3 compared to other domains . . . . . . . . . . . . . . . . . . . 55 6.4 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.4.1 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.4.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7 Conclusion 59 Bibliography 66 A Appendix 1 I A.1 Search words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 E-mail template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I x List of Figures 1.1 Activity diagram of a practical scenario. . . . . . . . . . . . . . . . . 3 1.2 Activity diagrams of a control mapping (left) and a completeness check (right). As an example, the control of ”Policy on the use of cryptographic controls" from ISO 27001:2013 was used and its imple- mentation guidance provided in ISO 27002:2013 was referenced when determining examples of characteristics for completeness. . . . . . . . 3 3.1 ISO 27001:2013 extract of the domain Human resource security. The table is an extract from table A.1 in Annex A of SS-EN ISO/IEC 27001:2017 and is reproduced with due permission from SIS, the Swedish Institute for Standards, who holds the copyright and also sells the complete standard www.sis.se. . . . . . . . . . . . . . . . . . . . . . . 12 3.2 ISO 27002:2013 extract of the implementation guidance for the con- trol A.7.3.1 defined in ISO 27001. The text is taken from SS-EN ISO/IEC 27002:2017 and is reproduced with due permission from SIS, the Swedish Institute for Standards, who holds the copyright and also sells the complete standard www.sis.se. . . . . . . . . . . . . . . . . . 13 3.3 Text normalization example of processing a sentence with each of the following NLP tasks: tokenization, case-folding, stop-word removal, and lemmetization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Text vectorization example using a frequency based approach. . . . . 15 3.5 A graph displaying different NLP models along with the number of parameters for each model [1], [2], [3], [4], [5]. . . . . . . . . . . . . . 18 3.6 A simplified overview of the transformer’s architecture. . . . . . . . . 19 3.7 The GPT-3 playground where the user can input tasks. In this case GPT-3 was asked to validate if a given code line is valid in the pro- gramming language Python. The green text is GPT-3’s response. . . 21 3.8 The playground prompt where GPT-3 is used as a classifier in its text completion mode. The green text is GPT-3’s response. . . . . . . . . 22 3.9 The playground prompt where GPT-3 is used as a classifier in its text completion mode. The green text is GPT-3’s response. Here GPT-3 fails to classify encryption as a part of the ISO 27001:2013 standard. . 22 3.10 The four step procedure taken by GPT-3 when used in its few-shot setting. Source of the image is https://beta.openai.com/docs/guides/classifications 23 xi List of Figures 3.11 Response from GPT-3 when sending the query "Is Cryptography part of ISO 27001:2013?"; done using the python library OpenAI. GPT-3’s answer can be seen at line 3 where it labels the input as Yes. In the selected_examples column returns how useful the uploaded examples were in classifying the input. Higher score means more useful. . . . . 24 3.12 The SVM classifier using a boundary line to separate and classify the datapoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 The engineering and design cycle as defined by Wieringa. . . . . . . . 27 4.2 The full list of contacted companies. In the Name column, the con- tacted company’s name is listed. In the Response column, the re- sults from the exchanges are listed. The cells were marked green if a response was received where they disclosed that they would, or would not, share their policy, or if they would redirect us. From two compa- nies information security policies were received, therefore their name was redacted to [Company]. . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 A bar-plot of the dataset containing ISO and non-ISO data points. . 31 4.4 The template used for data annotations including an example for the sake of illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5 Framework architecture for information security policy control clas- sification in relation to ISO. . . . . . . . . . . . . . . . . . . . . . . . 33 4.6 The overall machine learning pipeline. From documents, into the machine learning pipeline which consists of an NLP pipeline and a machine learning classifier, and yields an output. . . . . . . . . . . . . 35 4.7 Detailed view of the machine learning pipeline in Figure 4.6 including sub-tasks and models used. . . . . . . . . . . . . . . . . . . . . . . . . 35 4.8 The template used for expert validation session with an example for the sake of illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 The results of using the models on the A512 dataset. GPT-3 in it’s few shot setting with Ada as search model and Curie as the classification model is the best performing by achieving an accuracy of 0.7 and a F1-score of 0.727. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Results from all models evaluated on the dataset A722. The purple dotted line shows the ZeroR baseline. . . . . . . . . . . . . . . . . . . 46 5.3 Results from all models evaluated on the dataset A923. For the model Word2Vec a F1-score was unable to be calculated. The purple dotted line shows the ZeroR baseline. . . . . . . . . . . . . . . . . . . . . . 46 5.4 Scatter plots for each dataset, where the yellow dots are the expert’s answers to how complete a policy text extract was towards a given ISO control. The red and blue dots represents GPT-3’s probability towards the text extract being related to an ISO certified policy. The X-axis represents which text was used. The Y-axis represents the answers towards how complete the text was towards ISO 27001:2013. 47 xii List of Figures 5.5 Residual plots for each dataset, where the error is calculated by using the expert’s answers as the true value. If the dots are closer to the 0.0 line, then they are more aligned with the expert’s opinion. The Y-axis shows the error, and the X-axis shows for which text the error was calculated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 xiii List of Figures xiv List of Tables 3.1 The data used in the example. . . . . . . . . . . . . . . . . . . . . . . 23 4.1 Table of selected controls. The table is an extract from table A.1 in Annex A of SS-EN ISO/IEC 27001:2017 and is reproduced with due permission from SIS, the Swedish Institute for Standards, who holds the copyright and also sells the complete standard www.sis.se. . . . . . 34 4.2 The hyperparameters of GPT-3, in its few-shot setting, when clas- sifying Information Security Policies. In this table, logprobs = 2 indicated that the model returned the logarithmic probability values. 36 4.3 Confusion Matrix, here TP = True Positives, FN = False Negatives, FP = False Positives, TN = True Negatives. . . . . . . . . . . . . . . 38 4.4 Table of benchmark models with their word embedding types and models as well as the combined classifier. . . . . . . . . . . . . . . . . 39 5.1 Results of applying GPT-3 few-shot to the three different datasets. The bold text is used to show the highest value for each category. For values of K > 1 the average score was calculated across the runs. . . 44 5.2 F1-score and accuracy score for the fine-tuned GPT-3 model. The scores in bold are the best performing for that model and dataset. . 44 5.3 Features found with an increasing significance level for each control. . 49 5.4 Expert ranked features based on most characterizing for each control using a Likert scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.5 The scores from GPT-3 applied on different datasets. BioText, Med- STS and PubMedCRT are from the study by Moradi et. al [6]. ADE and NIS are from the study by Alex et. al [7]. The datasets A512, A722, and A923 are taken from the results of this study, whre the highest F1-score were chosen for each dataset. . . . . . . . . . . . . . 51 xv List of Tables xvi 1 Introduction Compliance with International Organization of Standards (ISO) standards regard- ing information security policies requires an organization to have an information security policy. Still, it is not easy to create a good one. An information security policy alone is far from sufficient in terms of providing adequate safety measures for an organization, and due to the effects of the increased significance of information technology, it has also received considerable attention. Additionally, the protection and security of information assets by organizations have become an increasingly more challenging task as the complexity of security threats has also increased [8]. The fact that large and high status companies such as Twitch (game-streaming plat- form) have been subjected to a data leak [9] and Kaseya (IT company that provides software to, among others, COOP) have been subjected to a ransomware attack [10] within the last twelve months signalizes that no organization is truly safe. Thus, on behalf of the stakeholders, there now exists a pressure and demand that organiza- tions accept their responsibility in terms of offering adequate information security measurements [11]. Well-known standards, such as the COBIT or the ones defined by ISO [11], provide guidelines and frameworks with various objectives that are used towards implement- ing a robust information security policy that reflect the needs and risks of an orga- nization [12]. Although, it has been suggested that the guidelines provided by the standards are too generic, and organizations find it challenging to assess the com- pleteness of their information security policy concerning them, it ultimately presents a key issue towards achieving certification [13] [14]. As a result, organizations find themselves in a time-consuming and expensive certification process [15]. A considerable amount of research has been conducted into assessing the complete- ness of privacy policies, mainly by utilizing machine learning methods [16] [17]. However, little research has been done within the domain of information security policies. Therefore it is not yet clear whether the use of machine learning methods can be modified to be applied within software engineering domains that are also business-critical. In particular, to assess the completeness of information security policies. Hence, additional studies on using machine learning methods within the domain of information security and its policies are needed. This study aims to alleviate the aforementioned completeness issue by investigating to what extent Natural Language Processing (NLP) models can help with determin- ing information security completeness in relation to the ISO 27001:2013 standard 1 1. Introduction with the intent to generalize the findings by offering a framework for completeness checking. Due to scarce publicly available information security policies, pre-trained language models, such as GPT-3, are used to maximize model learning rates with a small sample size. Additionally, this thesis aims to provide a framework that can be used without expert analysis to adjust an organization’s information security policy and fill the gap between the guidelines and individual characteristics of an organization by using accessible and inexpensive methods. Furthermore, there also exists a gap between defining a policy and applying it in practice. For example, it is possible for any organization to have an adequate information security policy in relation to ISO, but difficult to establish whether the organization actually implements the policy in practice by reviewing the policy alone. Hence, the study deals with "completeness", i.e., the presence of critical elements, rather than "compliance", in relation to ISO. In order to investigate the practical feasibility of the study as a real-world application and gain access to the domain knowledge of experts, a collaboration was established with Centiro. Centiro, which is henceforth referred to as this thesis’s case company, is an organisation specializing in software solutions and policy compliance and is located in Borås, Sweden. 1.1 Practical scenario The starting point for any organization to achieve ISO certification is to first define an ISO 27001 scope statement. The scope statement sets the boundaries on what processes, products, or departments an information security management system should cover within the organization. More importantly, the scope also allows for choosing what aspects of the ISO 27001 standard are needed to be implemented in order to be granted an ISO certification. These aspects are more commonly referred to as controls and the ISO 27001:2013 is made up of 114 of these controls. Hence, an organization is only needed to comply with controls that are deemed necessary based on their set scope. In other words, not all 114 controls are required in an in- formation security policy for an organization to achieve ISO certification [18]. Thus, the first step of completeness checking becomes to determine the controls to check against and map what part of the policy corresponds to which control. The second step in the process is to assess the completeness of the controls by confirming the presence of key elements crucial to the controls. Experts in the field commonly perform the second step, and the key elements are defined by the characteristics of a control. In an ISO 27001 completing document, referred to as ISO 27002, a guidance for control implementation is provided for each control and can be used as grounds to identify the key elements [19]. Observe the activity diagrams in Figures 1.1 and 1.2 for illustrations of the process overview and an example of mapping and completeness checking of a control. In the diagram given by Figure 1.1, a circle symbolizes the start and endpoint, a rectangle represents a process, and a rectangle (or rectangles) with a wavy bottom edge represents a document to serve as an input or output used by the processes. 2 1. Introduction Meanwhile, in the activity diagrams given by Figure 1.2, a circle symbolizes the start and endpoint, a diamond coupled with a question represents a decision, a di- amond without question represents a merge, a green rectangle indicates a resulting successful action, and a red rectangle indicates a resulting unsuccessful action. Figure 1.1: Activity diagram of a practical scenario. Figure 1.2: Activity diagrams of a control mapping (left) and a completeness check (right). As an example, the control of ”Policy on the use of cryptographic controls" from ISO 27001:2013 was used and its implementation guidance provided in ISO 27002:2013 was referenced when determining examples of characteristics for completeness. In Figure 1.1, the practical scenario is given from start to finish in an activity dia- gram on a holistic level. It is divided into one input, an information security policy, leading to two processes and two outputs, one for each step. The output from the control mapping step is a collection of extracts representing the result of annotating the information security policy by controls. The left activity diagram in Figure 1.2 is an example of such mapping, where a scope statement defines the need for the control [18]. After the analysis has been established, the second step in the process is to check the extracted control’s completeness by confirming the presence of key factors. An example of this is given in the right activity diagram in Figure 1.2. The output from the completeness checking is then a completeness assessment where each control is 3 1. Introduction deemed either complete or incomplete. Of course, for an expert to manually an- notate an information security policy that may have 114 control present and check for completeness of each one is not only time-consuming and labor-intensive, but also error-prone. Furthermore, a non-expert who lacks the knowledge of what is required to be complete in relation to ISO for each control, for example, what a strong cryptographic algorithm is, would struggle to define the lowest levels of the information security policy. Hence, supplying both experts and organizations with an automated solution is desirable. An automated solution could also be of interest to software engineers, not only be- cause it can dictate what software tools and methods of working are available to them, but also since it promotes the practice of information security aware develop- ment. Therefore, product owners who might lack legal expertise and legal experts who may lack knowledge of the software domain, should work in close collaboration to define a consensus on various elements for products and processes which could in its turn stem requirements for, for example, reliability and security. 1.2 Research questions The following research questions are divided into two major themes where one is made towards establishing a quality framework for evaluating information security policies in order to identify missing ISO compliance factors (RQ1), while the sec- ond one focuses on data analysis, model analysis, model evaluation, and comparison (RQ2). Ultimately, the latter research question aims to support the first with a language model. RQ1: What characterizes a good quality machine learning framework based on fac- tors such as the amount of data and manual labor needed for information security policy? RQ2: To which degree does a GPT-3 language model determine the ISO complete- ness of various organizations’ information security documents? RQ2.1: Which machine learning model features are beneficial for determining doc- ument coverage and alignment in relation to ISO? RQ2.2: To what degree can the features of GPT-3 enhance the document classifi- cation process, and how does it perform versus other algorithms? RQ2.3: How does the GPT-3 model trained and evaluated on information security policy documents compare to performances of GPT-3 models applied in other domains? Addressing RQ1 leads to the design of a framework, which lays the foundation for the document classification in RQ2. RQ2, a broader question, is further broken down into three sub-research questions. RQ2.1 allows for experimentation and analysis of different machine learning algorithms (and their various properties) to establish an optimal benchmark that is needed to compare future results with. An- swering RQ2.2 leads to comparisons with the GPT-3 model and the previously established benchmark while also leading to an investigation into what aspects of 4 1. Introduction GPT-3 aids (or hinders) in its classification ability versus standard algorithms. Fi- nally, addressing RQ2.3 leads to comparisons of achieved performance results of GPT-3 with applications within other domains to deem whether information secu- rity policy completeness is a reasonable domain of application of GPT-3. 1.3 Delimitation This thesis uses existing information security policies and already implemented nat- ural language processing models and does not define nor create any new ones. Fur- thermore, the time restrictions of this thesis limit the size of the dataset of gathered information security policy documents; with more time, a more defined and versatile dataset could be created. Furthermore, this thesis does not deal with compliance but rather completeness, as the former requires more attention and involvement of ISO-compliance experts. Finally, this thesis does not attempt a real-world applica- tion and evaluation of models but focuses on designing a possible solution to the problem context. 1.4 Report structure The remainder of this paper is divided into six chapters, where Chapter 2 covers the existing research and Chapter 3 provides the theory behind the study. Chapter 4 explains the execution of the study, while Chapter 5 presents the results. Afterwards, Chapter 6 provides discussion in relation to the results, threats to validity, and suggestions for future work. Finally, Chapter 7 provides explicit answers to the research questions together with final remarks. 5 1. Introduction 6 2 Related Work The area of text classification can roughly be divided into two categories [20], Rule- based methods and Machine Learning (ML) based methods. Rule-based methods require the researchers to have deep domain knowledge and use pre-defined rules to classify texts. In contrast, ML methods require models and pre-labeled data to learn the relations between the texts and their corresponding labels. In this section, the area of machine learning methods in natural language process- ing is further studied to understand what methods researchers are currently using to achieve state-of-the-art performance in terms of preprocessing steps, models, re- sampling procedures, and evaluation metrics. Furthermore, a closer look at similar studies involving the classification of policies are provided, to better estimate what the field looks like and what approaches different researchers have taken to tackle the problem of policy classification. Lastly, it is vital to understand how large-scale models such as GPT-3 and BERT are used in current research, their limitations, and the steps needed to achieve good performance. 2.1 Text classification in Natural Language Pro- cessing To better understand which ML models and preprocessing steps are commonly used in research concerning natural language processing, two main studies have been examined. First, Rahman et al. [21] used standard classification models, such as Support-Vector Machine (SVM) and Random forest, to classify sentiments of tweets in two different datasets. As a pre-sampling procedure, the researchers used K- fold cross-validation, where k = 4, to divide the data into training and validation sets. The best performing model (MaxEnt) achieved an average F1-score of 76%. Whereas an F1-score is a measurement of how well a model performs with the ad- dition of taking into account any misclassification. [22] used ML models to classify emails as either phishing emails or not. The models used were SVM, Naïve Bayes, Decision Tree, Long Short Term Memory (LSTM), and Convolutional Neural Net- works. The model with the best average accuracy was leveraging Convolutional Neural Networks. Furthermore, similar studies have been done by Miao. et al [23]. who used ML models to classify Chinese newspapers. The researchers used several different models, but the conclusion was that a Support Vector Machine (SVM) with a TF-IDF vectorizer yielded the best F1-score of 95.7%. Dadgar et al. [24] conducted a similar study; however, they analyzed English newspapers instead. 7 2. Related Work The evaluation was performed on two datasets, one from BBC, which contained five categories, and one from 20NewsGroup, which contained 20 categories. Their best-performing model was SVM with a TF-IDF vectorizer. The model was used on a dataset from the BBC and 20NewsGroup, where it achieved an F1-score of 95.67%. Furthermore, a similar study was made by Tzimourtas et al. [25] where SVM, Ran- dom Forest, and Naive Bayes were compared on the 20NewsGroup dataset. The best scoring model was SVM, with an accuracy of 95%. Even though SVM is a well- performing classifier, other models can also be used for text classification. Kim [26] investigated if simple Convolutional Neural Networks (CNN) with a small number of hyperparameters could perform well on text classification tasks. The researcher tested four models on seven different datasets and managed to achieve a better per- formance when compared to other studies that were made at the time. Sharma and Moh [27] conducted a study that used classifiers such as SVM, Naive Bayes, and a dictionary-based classifier to predict the outcome of the Indian election by determining the sentiment of tweets related to the election. The result was that the best performing model was SVM with an accuracy of 78.4%. 2.2 Policy classification In policy classification, based on the studies included in this section, the most popu- lar policy to classify were privacy policies. However, these still give a good overview of what different approaches are common when classifying policies. Story et. al [28]. conducted a study where privacy policies of mobile applications were analyzed to determine if the privacy policy covered the kind of data that the application was accessing and potentially sharing. They divided the privacy policy into categories, such as Email address, which indicated whether the app collected the user’s email address or not. For example the text "We collect your email address" would be classified as True by the Email Address classifier, but false by the GPS Lo- cation classifier. Furthermore, each classifier was split into classifying if a first-party or third party accessed the data. Hence both GPS Location 1st Party classifier and GPS Location 3rd Party classifier needed to be trained. The reason for dividing the classifiers in this manner was that data could be accessed by a first-party but not a third party in privacy policies. To train their models, Story et al. used an annotated set of documents, also known as an annotated corpus. Preprocessing steps such as stop word removal, vectorization, and normalizing of sentences were conducted in order to improve the models’ performances. Along with vectorization, the authors also used a manually crafted vector of boolean values, which indicated the absence or presence of characteristic words. The model used in the research was SVC, and the result was a mean F1-score of 71% over the 26 objectives that were classified. The conclusions that the authors drew from the results were that compliance issues in mobile privacy policies were common and that their proposed model, along with a mobile application analysis, can improve privacy transparency. 8 2. Related Work Furthermore, [17] conducted a study where the researchers analyzed privacy poli- cies and their completeness towards GDPR compliance. The analysis identified the absence or presence of metadata types in a text. For example, one metadata type was processing purposes, which concerned the "purposes of the processing for which personal data is being collected." The researchers used three different approaches to classify which metadata types were present in a policy. These approaches were a ma- chine learning approach using Support Vector Machines (SVM), a similarity-based classification using cosine similarity of sentence embeddings, and a keyword-based classification method that compared sentences to keywords related to a specific meta- data type. The result was that their completeness checking model, which used ma- chine learning, had an F1-score of 91.47%. Compared to a keyword-based approach, this method improved the F1-score by 32.35%. Narksenee and Sripanidkulchai [29] have conducted similar studies using machine learning models to determine if an ap- plication’s behavior complied with the application’s privacy policy. Thotawaththa et al. [30] investigated how machine learning models such as BERT could be used to classify privacy policies. Additionally, Alabduljabbarr et al. [31] conducted a study with the goal of reducing the read-time on privacy policies from a user perspective. The researchers utilized machine learning and deep learning models to classify the content of the policies and reduce the number of paragraphs that the users needed to read. The models utilized preprocessing steps such as stop-word removal, lemmati- zation, stemming, TF-IDF, Doc2Vec, Universal Sentence Encoder, and WordPiece. An ensemble of six machine learning and deep learning models was used to classify the privacy policies. The result was an F1-score of 91% on their validation dataset, and after an user study, they concluded that the read-time was reduced by 39.14%. Liang and Ye [32] conducted a study that aimed to create a classification process using three-way decisions [33] for inclusive policies. Their proposed model used a two-stage process. Firstly, an ensemble is trained and output a category with a confidence value (probability score). The value is then passed through a threshold, and if the probability score is lower than the threshold, the same data is then passed to a traditional machine learning model. With this setup, the researchers managed to achieve greater performance with their three-way decision model compared to ten other baseline models. The best-performing model used AdaCNN for the first stage and SVM for the second stage. These studies give insight into how previous work on policies have been conducted. All the studies divided the policies into categories and then created a classifier for each category. However, the researchers utilized different approaches. For example, Story et al. [28] used a dataset that was annotated by domain experts and used the categories in the dataset. Thotawaththa et al. [30] used a combination of domain expert insight and user perspective to choose categories. 2.3 GPT-3 & BERT In this section, studies related to GPT-3 and BERT is presented; these two models are also further explained in Chapter 3. 9 2. Related Work GPT-3, which is a pre-trained deep learning model by OpenAI, has been used for text classification purposes, such as classifying emails [34], detecting hate speech, and classifying racist or sexist texts [35]. However, GPT-3 has its limitations. One study by Moradi et al. [6] investigated if GPT-3 could perform well on text classi- fication tasks in the biomedical domain. The conclusion was that the model could not achieve state-of-the-art performance on the chosen NLP tasks when trained on just a few examples. BERT, which stands for Bidirectional Encoder Representation from Transformers, is another pre-trained machine learning model that has also been used in several studies. To understand how BERT can improve performance in binary classifica- tion tasks, Zhang and Zhang [36] conducted a study where BERT was used as an embedding layer for a downstream ML model. The researcher evaluated this model against benchmarks on the IMDB dataset. The result was that the model that used BERT as an embedding layer had an F1-score of 93.11%, which was an improvement of 2.01% compared to the best performing baseline model. 10 3 Background This section introduces the domain background, i.e., an overview of the informa- tion security policy and the ISO 27001:2013 standard. Afterward, the necessary background pertaining to the technical approach is summarized. 3.1 Domain background Information security policies represent an organization’s ability to safeguard infor- mation assets proactively. In other words, they are meant to exist as documentation of an organization’s approach to managing information security. They have, there- fore, also become acknowledged as an organization’s most crucial information secu- rity mechanism [37]. Information security alone is about providing "... protection of information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide confidentiality, integrity, and availability" [38]. However, researchers argue that technical implementations alone are no longer adequate for protecting an organization’s information assets and need to include more factors, such as the management and employees [8]. Thus, the information security policy, which provides "... directives, regulations, rules, and practices that prescribe how an organization manages, protects and distributes information" [39], has also become recognized as a crucial business document of any organization [8]. 3.2 ISO 27001:2013 Implementing an information security policy alone is not sufficient to safeguard the organization’s information assets. Although no perfect security and protection plan exists to this date, proper framework and technique implementation in the shape of various security standards help in minimizing the risk from harmful exploitation and establishes the best practices for information security management within an organization [12][8]. ISO 27001 is an example of such a security standard for infor- mation management systems and it provides rules and guidelines for organizations to follow in order to decrease the risk of information and information systems being exposed [11]. Increasing an organization’s compliance to such standards, therefore, assists with establishing a robust Information Security Management System (ISMS) [8]. 11 3. Background The standard, in its entirety, specifies 114 controls divided into 35 control objectives which are further divided into 14 domains. Where a control is a type of safeguard, a control objective is a statement that defines the result of implementing said con- trol/controls, and a domain is a grouping of control objectives that belong to a specific theme [11] [40]. The need for each control to be implemented is defined by an ISO 27001 scope set by an organization prior to applying for an ISO certifica- tion. Hence, only a subset of controls are required to be ISO compliant, but most commonly, all. Observe the extract from ISO 27001 in Figure 3.1 for an example of an entire domain. Figure 3.1: ISO 27001:2013 extract of the domain Human resource security. The table is an extract from table A.1 in Annex A of SS-EN ISO/IEC 27001:2017 and is reproduced with due permission from SIS, the Swedish Institute for Standards, who holds the copyright and also sells the complete standard www.sis.se. In Figure 3.1, the domain is defined as Human resource security (A7) and consists of three control objectives: A.7.1, A.7.2, and A.7.3. The control objectives pertain to the different possible statuses of employment. These control objectives then have six controls: A.7.1.1, A.7.1.2, A.7.2.1, A.7.2.2, A.7.2.3, A.7.3.1. The implementation of these controls is then what is required to fulfill the control objectives and ultimately 12 3. Background also be compliant with the domain [18]. Furthermore, each control is supplied with implementation guidance in an ISO 27001 completing document known as ISO 27002:2013. For an example of such guidance, observe the extract from ISO 27002 provided in Figure 3.2. Figure 3.2: ISO 27002:2013 extract of the implementation guidance for the control A.7.3.1 defined in ISO 27001. The text is taken from SS-EN ISO/IEC 27002:2017 and is reproduced with due permission from SIS, the Swedish Institute for Standards, who holds the copyright and also sells the complete standard www.sis.se. In the Figure 3.2, the implementation guidance provides more information regard- ing what the implementation of the control related to termination or change of employment responsibilities should cover. However, the guidance is not tailored to the control requirement of organizations, and its implementation may also not be sufficient to pass a certification [19]. 3.3 Natural language processing Natural Language Processing (NLP) is an area within AI that is concerned with using computational science to process natural language data to enable machine learning model construction. More specifically, by utilizing various NLP tools, any human language set of documents can be processed and represented in numerical forms that can be used in conventional data analysis or machine learning techniques [41]. The relevant NLP tools and techniques used in this study are mainly related to text normalization and text vectorization. Where text normalization defines a standardized word format, and text vectorization maps the elements of a text to a numeric format. 13 3. Background 3.3.1 Text normalization Text normalization consists of a set of tasks pertaining to converting natural lan- guage text to a simplified and standard format that enables comparison with other normalized texts by eliminating various redundancy and anomalies through gen- eralization. Among these tasks, a few are commonly found in any normalization processes and are also relevant to this study. These are mainly word tokenization and word format normalizing [42]. The task of word tokenization is the simplification part of text normalizing and con- sists of segmenting (or tokenizing) words from a running text and adding these to a comprehensive set (or vocabulary). This is also known as a text parsing operation and acts as an entry point toward word format normalizing. Meanwhile, word for- mat normalizing comprises tasks that change the segmented words into a standard format that is defined by a chosen pipeline of tasks [42]. Case folding, stop-word removal, and lemmatization are a few examples of what could be included in that pipeline. Refer to the list below for a description of each task. Case-folding: Maps all letters to lower case such that, for example, Policy and policy are represented the same. Case-folding, due to its simplicity, has been recognized as a common practice among practitioners, and thus its usage also enables popular NLP libraries, packages, and word lists [43] [44]. However, a disadvantage of using case-folding is the inherent ambiguity [42]. For example, words such as GloVe a method for learning word embeddings and glove a clothing item would be considered the same. Stop-word removal: Removes a class of words known as stop-words. Stop-words are words that are frequently present in any text, such as has and a. There- fore, their presence is trivial in most use-cases. The removal can be done by removing stop-words by using a predefined list or by removing a top percentile of words in any vocabulary set [42]. Lemmatization: Maps all variation of a word to its corresponding root (or lemma) [42]. For example, has, had, have, and having are mapped to their shared lemma have and recognized and recognize to its lemma recognize. For a visual representation of each task, observe the example given in Figure 3.3. In the Figure, the first row of the first column demonstrates the tokenization process. In contrast, the second column represents the output of each previously mentioned task with the tokenized text as input. Finally, the second row of the first column provides an example output of how a result of the text-normalization process could look like if all the tasks were to be used in a pipeline. 3.3.2 Text vectorization Text vectorization consists of a set of tasks in NLP pertaining to mapping words or sentences from a text to a vector within a predefined vector space, also known as a vector space representation. This process is also more commonly known as a word embedding or embedding technique. The embedding may take on different repre- 14 3. Background Figure 3.3: Text normalization example of processing a sentence with each of the following NLP tasks: tokenization, case-folding, stop-word removal, and lemmetiza- tion. sentations depending on the vector space utilized and the corresponding embedding technique used to map to it [42]. Moreover, the embedding techniques can also be context-insensitive and context-sensitive [45]. Context-insensitive embedding techniques such as Frequency and Sequence based representations primarily deal with mapping single words to single vectors [45]. Whereas a frequency-based representation has a word frequency-based mapping while a sequence-based representation has a mapping with a focus on sequencing sets of words [42] [46]. Context-sensitive embedding techniques such as contextual based representations, on the other hand, maps multiple contexts for the same word to multiple vectors [45]. Although all embeddings use multidimensional vectors and can be coupled with ma- chine learning algorithms, their success in various practical applications and insights gained from these applications may differ [46]. For a simple example of how a basic frequency based embedding could look like, observe Figure 3.4. Figure 3.4: Text vectorization example using a frequency based approach. 15 3. Background In Figure 3.4, the box on the top represents the input into the vectorizer. The middle box is the vocabulary that stems from the input. Finally, the box in the bottom is the resulting vector from processing the input in relation to the vocabulary. This approach simply counts all the occurrences in the document and is therefore considered a frequency based representation. 3.3.2.1 Frequency based representations Frequency-based word embedding is the most commonly used vector space repre- sentation and uses an approach of counting word occurrences to construct sparse multidimensional numeric vectors. More specifically, a set of words is mapped to a matrix where each word corresponds to a column in the matrix, and its frequency is contained in the rows [46]. The resulting frequency-based matrix is characterized as being high-dimensional and sparse. This is due to large vocabulary sizes that directly correspond to a large set of columns (i.e., words) where most of the rows (i.e., word frequencies) are zero since each document only contains a small subset of the comprehensive vocabulary [46]. Furthermore, an important observation is that a frequency approach considers the order and position of words as irrelevant. Hence, this approach is also known as a bag-of-words approach [47]. An example of word embedding techniques that uti- lize the bag-of-words approach is the Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. The TF-IDF vectorizer uses the product of two terms. The first term is the term frequency, i.e., the frequency of a word in a text, and the second term is the inverse of the document frequency, i.e., the presence of a word across all documents. Hence, the values are either zero or a positive real value. While the first term alone is sufficient in certain applications, the second term provides a normalization factor such that words that appear in few documents are given a higher weight [42]. However, using both the terms may also give higher weight to errors and misspellings that were not captured during the preprocessing step. Therefore, the choice of whether to use a pure term frequency or TF-IDF can be application, and corpus specific [46]. 3.3.2.2 Sequence based representations Sequence-based word embedding is built on the notion of a distributional structure suggested by Harris [48] which states that similar words tend to occur in similar contexts. Hence, this embedding class uses an approach of capturing useful syntac- tic and semantic properties in a given text in order to construct its vectors [42]. In contrast to the frequency-based representation, a sequence-based word embedding investigates the likelihood of a set of words ending up near one another by training a machine learning prediction model. The learned weights are then transferred to the word embedding matrix. As a result, unlike a frequency-based word embed- ding, which utilizes the entire vocabulary, the sequence-based one registers fewer properties. Therefore, the resulting matrix from sequence-based word embedding is 16 3. Background characterized as dense, short, and can contain any real-value [42]. Two examples of prominent sequential word embedding techniques are Word2Vec and Global Vectors (GloVe). Although both use local context to capture various word semantics, i.e., semantics between words within a defined set size, GloVe takes it one step further to include global context. More specifically, GloVe also attempts to find semantic relationships between the words on a corpus level by utilizing global corpus statistics such as word co-occurrence probability ratios [45]. However, the techniques also have a shortcoming of poor prediction on words that have previously not been seen, i.e., out-of-sample predictions. 3.3.2.3 Contextual based representations Contextual-based representations provide representations of words in context. Un- like frequency and sequence-based representations that utilize a single vector embed- ding per word, contextual-based representation yields an entirely new vector every time a word is encountered in a new context. The vector is then a representation of a specific word type in a specific context. This embedding can be used to compare differences between two words in a context and determine their similarities [42]. 3.3.3 Pre-trained models In recent years, researchers have found out that pre-trained NLP models often out- perform models that have been built from scratch by using task-specific corpora. More specifically, these are pre-trained models that have learned relevant informa- tion from large sets of corpora prior to being applied to a new practical application [49]. This discovery has not only led to an increase of publicly available pre-trained word embedding models [50], but has, together with transformer models and contex- tual embeddings, also enabled a new functionality referred to as few-shot learning. Few-shot learning models require only a few samples to achieve good performances. Its emergence is built on predecessors that, with the years, have become more ad- vanced and inherits increasingly more parameters and improvements in terms of successful NLP task accomplishment. They are then trained on enormous corpora to require little fine-tuning and still be able to achieve state-of-the-art performance. Observe Figure 3.5, for an overview of the size difference between a few pre-trained language models. In the figure, the two models GPT-3 by OpenAI [1], and BERT by Google [5] are presented along with their sizes and are of interest to this study. 17 3. Background Figure 3.5: A graph displaying different NLP models along with the number of parameters for each model [1], [2], [3], [4], [5]. 3.3.3.1 Transformers One of the main enablers of large pre-trained models is the neural network ar- chitecture Transformer, as described by Vaswani et al. [51]. The purpose of the Transformer, as described by the authors, was to create an architecture that was less complex and more efficient, compared to models utilizing recurrence and convo- lutions. Therefore the Transformer utilizes the Attention mechanism, which has the advantage of only requiring O(1) sequential operations, while a Recurrent Neural Network requires O(n) operations, where n is the sequence length. Furthermore, it uses a multi-headed self-attention operation over the input context tokens followed by position-wise feed-forward layers to produce an output distribution over the tar- get tokens. This has been shown to outperform other machine learning models on various tasks such as machine translation and language modeling. The self-attention layer in the Transformer architecture builds on the attention mechanism proposed by Bahdanau et al. [52]. The self-attention layer allows the model to simultaneously attend to different parts of the input sequence. It has been used in several works [53][54][55] but then often in combinations with RNN. Vaswani et al. proposed that there is no need for using RNNs and only the attention mech- anism is enough. In the Transformer model, Vaswani et al. used Scaled Dot-Product Attention which is defined in 3.1. In the equation, Q, K, and V represent a Query, a Key, and a Value, each of these values are words from the input sentence. dk represents the keys of dimension. Scaled dot-product attention is almost identical to normal dot-product attention apart from the use of the factor 1√ dk . Vaswani et al. motivated this factor by indicating that for large values of dk, the dot-product itself will grow very large. The Attention score in 3.1 is calculated for each word in the input. 18 3. Background Attention(Q, K, V ) = softmax(QKT √ dk )V (3.1) A simplified overview of the Transformers architecture can be seen in Figure 3.6. The use of shifting the output, along with the masked multi-head attention layer, ensures that output prediction only relies on inputs which are previous to the output. [51]. Figure 3.6: A simplified overview of the transformer’s architecture. 3.3.3.2 GPT-3 by OpenAI GPT, which stands for Generative Pre-trained Transformer, was first introduced by Radford et al. [56] in 2018. This research aimed to create a model that could achieve strong natural language understanding without the need for large changes when applying the model to different tasks such as entailment tasks, similarity tasks, question answering tasks, and commonsense reasoning tasks. GPT uses a multi-layer Transformer decoder model [51] due to its excellent transfer performance on differ- ent tasks. The training phase of GPT consisted of two stages. Firstly it is trained on a large corpus of unlabeled data. Secondly, the model’s parameters are adapted using dis- criminative fine-tuning. To accomplish the first stage, the researchers use standard language modeling to calculate the likelihood L, which depends on the conditional probability P . The conditional probability is modeled using a neural network where 19 3. Background the parameters are trained using Stochastic Gradient Descent. For the second stage, the goal is to maximize the likelihood seen in 3.2 L2(C) = ∑ (x,y) P (y|x1, ..., xm) (3.2) Where C is a labeled dataset, x is the input tokens, and y is the labels. The model was trained on the BookCorpus dataset, containing 7000 unpublished books. Then, to benchmark the model, it was fine-tuned on another set of data depending on the task. This resulted in GPT achieving state-of-the-art performance on 9 out of 12 datasets. However, GPT was only the first iteration of this model. In 2019 Radford et al. [3] released a new study where they had conducted further research to create a new model, which was called GPT-2. GPT-2 also uses the Transformer architecture, and its foundation is similar to the original GPT, however, with a few changes. Most notably, the researchers’ largest version of GPT-2 contains 1.5 billion param- eters and 48 layers. Furthermore, the vocabulary was expanded, context size was increased, and a larger batch size was used. The model was trained on the WebText dataset, which the researchers created using web scraping techniques focusing on retrieving only high-quality documents. The resulting dataset consisted of over 8 million documents [3]. GPT-2 achieved state-of-the-art performance on 7 out of 8 studied datasets in its zero-shot setting, meaning that it was not fine-tuned on any training data before evaluation. The most recent iteration of GPT, called GPT-3, was proposed by Brown et al. [1] in 2020. The trend continues to expand the number of parameters for the model. In total, eight models were created, with the number of parameters ranging from 125 million to the largest having 175 billion parameters[1]. The largest model is the one known as GPT-3. Compared to GPT-2, a few modifications were made, but it still follows the same architecture. It was trained on a larger dataset than GPT-2 and nearly matched performance to fine-tuned models on benchmark datasets. The authors noted that this is a promising result due to GPT-3 only requiring 10-100 examples in its few-shot setting to achieve good performance. Comparing this to fine-tuned models, which can require training labeled datasets with hundreds of thousands of examples. 3.3.3.3 Using GPT-3 GPT-3 cannot be used in an offline fashion compared to models such as BERT and models from SciKit-learn. These models can be downloaded to the computer and trained, evaluated, and validated without an internet connection. GPT-3 is different since its primary method of communication is an API. The API gives the user access to OpenAI’s file uploading system, use of their models, fine-tuning of models, and also the creation of embeddings. GPT-3 has a playground mode which will not be used in this study since it is in- effective when classifying larger quantities of data. The playground mode presents 20 3. Background a prompt to the user where it is possible to input tasks to GPT.3. GPT-3 can be asked to complete a sentence, classify an animal, or translate something from one language to another. A example of this can be seen in Figure 3.7. The "=>" sign, which is seen in the Figure, is called the separator, and it tells GPT-3 where the task ends. This sign can be chosen arbitrarily as long as it is not present anywhere else in the prompt. The playground mode will not be used largely in this study. However, it can be good for demonstration purposes and to explore how GPT-3 behaves when certain tasks are fed. Figure 3.7: The GPT-3 playground where the user can input tasks. In this case GPT-3 was asked to validate if a given code line is valid in the programming language Python. The green text is GPT-3’s response. Furthermore, OpenAI provides an API for communicating with GPT-3. The API can be accessed by OpenAI’s Javascript library, Python library, and CURL com- mands. These methods make it possible to input classification tasks using code rather than the playground prompt. This study will mainly focus on using the Python library rather than the playground prompt, as it enables multiple requests without pasting the tasks into the playground prompt. 3.3.3.4 Using GPT-3 as a classifier GPT-3 can be used as a classifier for simpler tasks in its text completion mode. In Figure 3.8, GPT-3 is used to classify if a language is either an object-oriented language, or a functional language. This classification works well, and if the use case were to use GPT-3 as a language classifier, it would work. However, when tasking GPT-3 with more complicated domains, such as information security policies, its text completion mode is insufficient. This is displayed in Figure 3.9 where it determines that Encryption is not part of ISO 27001:2013, even though it is. For this reason, this study will use GPT-3 in its classifier setting instead. Example of using GPT-3 as a classifier 1 GPT-3, in its few-shot setting, works a bit differently than its text completion mode. The whole procedure can be seen in Figure 3.10. The main differences to text completion are that examples of the data need to be uploaded to OpenAI. GPT-3 then uses the most relevant example data to classify the input. Using GPT-3 as a classifier can be divided into the following steps, 1. Format the data in JSONL format 1https://beta.openai.com/docs/guides/classifications 21 3. Background Figure 3.8: The playground prompt where GPT-3 is used as a classifier in its text completion mode. The green text is GPT-3’s response. Figure 3.9: The playground prompt where GPT-3 is used as a classifier in its text completion mode. The green text is GPT-3’s response. Here GPT-3 fails to classify encryption as a part of the ISO 27001:2013 standard. 2. Upload the data to OpenAI, the API will respond with a file id corresponding to the uploaded data. This needs to be saved for later use. 3. Use either CURL commands, the Python library, or the Javascript library to send a classification request to GPT-3. In order to improve the classification shown in Figure 3.9, the steps mentioned can be used as follows. 1. First, some example data needs to be provided, which will be created manually for this example. The input to GPT-3 will be of the format "Is [X] part of ISO 27001:2013?". The sample data can be seen in table 3.1, which will need to be formatted into JSONL2 before submitting it to OpenAI 2. Using Python with the openai library will submit the data to OpenAi. openai.File.create(file=open("example_data.jsonl"), purpose="classifications"). 3. Then the following code will submit the query to GPT-3 for classification. model = openai.Classification.create( file=fileid, 2https://jsonlines.org/ 22 3. Background Figure 3.10: The four step procedure taken by GPT-3 when used in its few-shot setting. Source of the image is https://beta.openai.com/docs/guides/classifications query="Is Cryptography part of ISO 27001:2013?", search_model="ada", model="curie", max_examples=5 ) 4. The response from GPT-3 can be seen in Figure 3.11. GPT-3 can now predict that Cryptography is part of ISO 27001:2013. The score indicates how useful GPT-3 deems the example to be. All examples seem to have a rather high score, indicating that GPT-3 is somewhat unsure of how to use these examples. However, this small example still gives insight into how GPT -3’s classification works. . Text Label Is User Access Management part of ISO 270001:2013?### Yes Is Information Classification part of ISO 270001:2013?### Yes Is geography part of ISO 270001:2013?### No Is handling of squirrels part of ISO 270001:2013?### No Table 3.1: The data used in the example. 3.3.3.5 BERT by Google BERT, which stands for Bidirectional Encoder Representation from Transformers, is another pre-trained model similar to GPT-3 but uses a different architecture. BERT was originally proposed by Devlin et al.[5]. The model is designed to pre-train deep bidirectional representations from unlabeled texts by jointly conditioning on both left and right contexts in all layers. BERT was built on top of the Transformer, an attention-based model that learns contextual relations between words in a text. The 23 3. Background Figure 3.11: Response from GPT-3 when sending the query "Is Cryptography part of ISO 27001:2013?"; done using the python library OpenAI. GPT-3’s answer can be seen at line 3 where it labels the input as Yes. In the selected_examples column returns how useful the uploaded examples were in classifying the input. Higher score means more useful. Transformer was initially designed for machine translation, but BERT is adapted for natural language understanding tasks such as sentence classification, question answering, and next sentence prediction. BERT represents words in a text as vectors, or "embeddings". These embeddings are learned jointly with a two-layer bidirectional Transformer encoder. The Transformer encoder reads the text input sequentially and learns to predict words that are masked (replaced with [MASK]) or randomly shuffled. The training process of BERT is unsupervised, meaning that it does not require labeled data. This makes it more efficient and scalable than previous models trained on labeled data. BERT is effective because it can capture the context of a word in a sentence, rather than just the word itself. This is due to the bidirectional nature of the Transformer encoder. 3.4 Support Vector Machine (SVM) A support vector machine (SVM) is a kernel linear classifier [57]. SVM works in all dimensions [58], but for simplicity’s sake, this section only considers the linear case. SVMs are based on the concept of dividing the data into classes by using the best fitted decision boundary [57], and this can be observed in Figure 3.12. In higher dimensions, the decision boundary is referred to as a hyperplane. For example, if the data were in three dimensions, the decision boundary would be a plane. The equation for calculating the decision boundary is given by equation 3.3. In the equation, w are the weights, b is the bias, and x is the decision point. +1 and -1 24 3. Background are used to label the two classes. wTx + b = ≥ 0 class +1 < 0 class -1 (3.3) However, using this form of classification can lead to misclassification if the points are close to the boundary line with small changes to x [57]. Therefore to make the model more robust, a parameter ϵ can be added, which denotes how far the closest data point can be separated from the decision boundary. This is also referred to as a margin and the SVM model will attempt to fit a hyperplane that can maximize it. Reasoning behind this is that the larger a margin is, the larger the distance is between the classes, and therefore the easier it is to differentiate between them. The decision boundary can then be written as equation 3.4 and can also be observed in Figure 3.12. wTx + b = ≥ ϵ2 class +1 < −ϵ2 class -1 (3.4) Figure 3.12: The SVM classifier using a boundary line to separate and classify the datapoints. 25 3. Background 26 4 Research Design This study followed the research method Design Science Methodology (DSM) as de- scribed by Wieringa [59]. The purpose of DSM is to identify a problem, suggest an artifact that may improve the problem context, and validate whether the artifact operates as intended. This process is defined as a Design Cycle and is to be per- formed as an iterative process. The design cycle is part of a larger problem-solving cycle defined as an Engineering cycle. Wieringa defines the engineering cycle with the following five tasks: 1. Problem investigation 2. Treatment design 3. Treatment validation 4. Treatment implementation 5. Implementation evaluation The design cycle is task one through three of the engineering cycle and may be performed numerous times before attempting a real-world attempt of the treatment, i.e., task four and five. For a visual representation of the engineering and design cycle, refer to Figure 4.1. Figure 4.1: The engineering and design cycle as defined by Wieringa. The methodology behind this study had a design cycle focus, and any real-world treatment implementation and evaluation are left as possibilities for future work. 27 4. Research Design Hence, this section also follows the design cycle structure by outlining the problem investigation and the methodology utilized to design and validate an artifact. Al- though the design cycle is an iterative process, this section discusses the tasks in a linear fashion. In the problem investigation section, the problem context is introduced as provided by the case company. The problem context is then further investigated for the pur- pose of identifying and defining the problem. After the investigation, an artifact that interacts with the defined problem is suggested in the treatment design stage. Fi- nally, in treatment validation the validation methods for the suggested artifact that were used are presented. Furthermore, a discussion of how the research questions are intended to be answered is provided in this section. 4.1 Problem Investigation The case company, which specializes in software solutions and policy compliance, defined their problem context as information security policy content review in rela- tion to ISO and expressed a desire for improvement. Hence, the problem context was investigated together with the experts at the case company by conducting continu- ous knowledge transfer sessions in the form of meetings, where frequent discussions regarding their difficulties and work processes were held. These sessions were per- formed in order to add more background to the problem context and the current solution. For more details on these sessions, refer to Section 4.4. Among the identified difficulties was the time-consuming, labor-intensive, and error- prone process of manually checking for completeness of each ISO control (as de- scribed in section 1.1). Furthermore, the accepted level of information security that is defined by ISO is rather vague and may be formulated differently depending on the characteristics of an organization, despite using the same standard. Automating the completeness checking process by building machine learning models that are trained on the different variations can save a lot of time for the expert performing the information security policy review in relation to ISO. The saved time can in- stead be allocated towards controls that either fail the completeness check or are more difficult to classify. Hence, the problem investigated in this study was to determine the completeness of information security policies in relation to the ISO 270001:2013 standard utiliz- ing different classical and modern NLP and machine learning methods. The main stakeholders affected by the study were any organization concerned with information security, policy experts, and researchers from, but not limited to, the disciplines of software engineering and applied artificial intelligence. The project has the possibil- ity of providing organizations with an affordable method to achieve an acceptable level of information security and/or detect shortcomings in an information security policy in relation to ISO. To establish a solution for this problem context, informa- tion security policies needed to be gathered in order to evaluate a machine learning model and framework that ultimately became the design artifact of this study. 28 4. Research Design 4.2 Treatment Design This section provides details on the approach used by the study for designing an artifact. More specifically, descriptions of understanding and preparations of data, the creation of baseline and benchmark models, and GPT-3 models are provided here. Furthermore, a discussion of what a good quality machine learning framework could be, i.e., RQ1, is also provided. 4.2.1 Data understanding The ISO 27001:2013 standard consists of 114 controls belonging to 35 control ob- jectives which in its turn belong to 14 domains. In other words, a domain is at the highest level, control objective is at the middle level, and a control is at the lowest level. Hence, the possible levels of data granularity to use for model learning are those three, and a large portion of all the domains should be present in any ISO certified information security policy, as established in 3.2. However, the structure of information security policies, regardless of ISO certification, can look quite different from one policy to another. The differences are not only due to a normalized structure not being used, but also organizations’ unwillingness to share their policy at the lowest level as information security research has been defined as “one of the most intrusive types of organiza- tion research" [60]. Hence, the lowest level of an organization’s information security policy may often be hidden, and its higher levels may be too different from other policies, and/or the information may be too spread out across the policy to define where it is covered. Which ultimately lead to a major realization and decision for the course of this study. The realization is the fact that it may be impossible to measure the completeness of an information security policy in relation to all the 14 domains, 35 control objectives, or 114 controls. Mainly due to the reasons mentioned above, but also due to time limitations. Therefore, for this study, the focus shifted to establishing a proof-of- concept of measuring completeness in relation to a few controls, control objectives, or domains, rather than the entirety of the ISO standard. The collection of controls were carefully selected together with an expert that had an insight into what content was most often present in any information security policy without having access to the lowest levels. 4.2.2 Data collection and overview The data collection was done in iterations. It started out with finding information security policies that were publicly available on organization websites and later also moved to contacting organizations through customer support or directly through phone or email. When searching for policies on the internet, search strings such as information security, information security policy, isp, public information security 29 4. Research Design policy, and security policy were used, see Appendix A.1 for the full list. When contacting customer support through email, a pre-written email was sent, which disclosed who the authors were, what the purpose of the data collection was, and that anonymity would be guaranteed if they decided to share their policy, see Appendix A.2 for the template. When choosing companies to contact, a randomly selected subset of 30 was used from the list of the 100 largest companies in Sweden by turnover1. The full list of contacted companies can be observed in Figure 4.2. Figure 4.2: The full list of contacted companies. In the Name column, the contacted company’s name is listed. In the Response column, the results from the exchanges are listed. The cells were marked green if a response was received where they disclosed that they would, or would not, share their policy, or if they would redirect us. From two companies information security policies were received, therefore their name was redacted to [Company]. The industries and geographical location of the organizations were diversified, where some had more exposure than others. For example, the leading industry from the gathered dataset was Academics, and the leading country was the UK. As for the structure of the documents, it varied greatly from policy to policy. The assumptions made in the previous section were quickly found to be true as the policy 1https://www.largestcompanies.com/toplists/sweden/largest-companies-by-turnover 30 4. Research Design content could, for example, be in bullet points, clearly marked out sections, or in plain text. Few policies also referred to internal documents to which access was only allowed to authorized personnel. The biggest difficulty faced with data gathering was the sparse publicly available data that could be collected, but also confirming ISO certifications. This was made through third-party websites that have a record of ISO 27001:2013 certified compa- nies and extensive searching on company websites. In total, the dataset that was later annotated contained 49 information security policies. The distribution of the dataset with respect to labels was fairly even and can be observed in Figure 4.3. Figure 4.3: A bar-plot of the dataset containing ISO and non-ISO data points. 4.2.2.1 Data annotation After a data gathering iteration had been performed, the data was extracted and annotated by the authors according to a set of controls and also labeled if it was ISO certified or not. The set of controls was selected together with a policy expert such that the annotations could be carried out by non-experts since it is a time-consuming task. Hence, the data was annotated by two non-experts and was annotated twice per selected set of controls in order to avoid conflicting perspectives. Observe Figure 4.4 to see the template data-sheet that was used for data annotations in this study. Figure 4.4: The template used for data annotations including an example for the sake of illustration. 31 4. Research Design In Figure 4.4, the first column corresponds to the file name in the dataset and exists to not confuse with un-annotated files. The first column was also immediately dropped before any data processing begun. The second column corresponds to the ISO label of the information security policy. The third column represents whether the control was existent in the policy or not. If it was not found and the policy was labeled as a non-ISO, a random number generator was used for the selection of random text somewhere in the policy. The reasoning behind this was to fill out with data that the model could still learn from. Conversely, if the control was not found and the policy is labeled as an ISO, that extract would be left empty to avoid biasing the model. The fourth and fifth column were used as a method for reducing bias in the annotation process. The first annotator extracted the text and labeled the data point, then put a "1" in the Annotation_1 column to indicate that the data point had been annotated once. The second annotator then reviewed this annotation, if the second annotator agreed, then the data point was seen as completed. However, if the annotators disagreed, then the data point was discussed until a consensus was reached or be brought up with a policy expert from the case company (see section 4.4). Finally, the sixth column is the extract from the policy that was later used by the models. 4.2.3 Modeling In order to answer RQ1, a machine learning framework needed to be established and the decision was to be made based on the factors of data and manual labor needed at each data granularity level. The data granularity levels are defined by the level of detail in the texts [61]. For example, a domain level is a very high-level statement since it contains the least amount of detail. The need for data to be sufficient in training a successful model is determined by an estimation of the number of controls covered and their inherent complexity. To illustrate, a domain that contains the most controls, each with their own complex- ity, is estimated to have a higher need for data than a singular control. In other words, the less detailed level of data granularity chosen, the more data is needed to cover the inclusion of additional controls. This argument is founded in that the more controls are included, the more variance and variables have to be taken into account and thus would have a higher data demand. On the other hand, the more detailed level of data granularity chosen, the tougher it is to identify and annotate without the help of experts. After many discussions, a consensus was reached together with experts that the finest level of data granularity, i.e., a control level, was the most preferred level of granularity. This choice was made on the basis that a control-level model brought the most benefit for the experts as that was where the completeness checking was most often made. Additionally, this put less pressure on the already scarce dataset. To avoid difficulties with annotations, the controls were selected such that non- experts could also identify them with a few guidelines from the experts. 32 4. Research Design The final architecture of the framework is presented in Figure 4.5 and remained to be evaluated to deem whether the estimated factors were sensible and if it performed better than baseline models. Figure 4.5: Framework architecture for information security policy control classi- fication in relation to ISO. In Figure 4.5, the input to the framework is defined in the box to the left. It consists of an user inputted selection of control and an extract corresponding to the control from the information security policy in question. The box in the top right represents the machine learning pipeline and consists of an NLP pipeline that is combined with a binary machine learning classifier. Finally, the output in the box to the bottom right is the result from the classifier and outputs either a one for the control having ISO level content and zero if it does not. 4.2.3.1 Control selection The selected controls were selected together with a policy expert with an additional assumption that the control was present in most public information security policies. Furthermore, the selected controls also varied in factors that needed to be complete- ness checked against. In other words, this meant that the length, wordiness, and different ways to formulate the extracts varied between the controls. The purpose behind this was to measure the model’s success with even the more difficult controls to check completeness against. The selected controls are listed in Table 4.1. 4.2.3.2 Data preparation and models The data was prepared before being processed by machine learning models by first applying it through an NLP pipeline. Whilst there were many different algorithms and packages to utilized for each task in the pipeline, the overall process is as depicted in Figure 4.6. In Figure 4.6, documents (or control extracts from information security policies) are inputted into a machine learning pipeline. The pipeline itself consists of two sections. The first is the NLP pipeline, and the second is to combine it with a ma- chine learning classifier. In combination, an output of a classification label should be 33 4. Research Design Control id Control name Control definition A.5.1.2 Review of the poli- cies for information security The policies for information security shall be re- viewed at planned intervals or if significant changes occur to ensure their continuing suitability, ade- quacy, and effectiveness. A.7.2.2 Information security awareness, education and training All employees of the organization and, where rele- vant, contractors shall receive appropriate aware- ness education and training and regular updates in organizational policies and procedures, as relevant for their job function. A.9.2.3 Management of priv- ileged access rights The allocation and use of privileged access rights shall be restricted and controlled. Table 4.1: Table of selected controls. The table is an extract from table A.1 in Annex A of SS-EN ISO/IEC 27001:2017 and is reproduced with due permission from SIS, the Swedish Institute for Standards, who holds the copyright and also sells the complete standard www.sis.se. expected. The NLP pipeline comprises of three tasks: text parsing, text normaliza- tion, and text vectorization. Meanwhile, the machine learning classifier consists of a single binary classifier. For a more detailed view of each task within the machine learning pipeline, observe Figure 4.7. In Figure 4.7, text parsing is defined by tokenizing the text and acts as an input to the text normalization task, which has the three sub-tasks: case-folding, stop-word removal, and lemmatization. For case-folding, the python standard library string operations were used. For the stop-word removal and lemmatization, a pre-defined list of stop-words and a lemmatizer algorithm known as the WordNetLemmatizer, of which both came from the NLTK-library, were used. The following task, text vec- torization, consisted of the three different types of word embedding models defined in 3.3.2. The TF-IDF vectorizer from the scikit-learn library was selected to rep- resent the frequency-based model. For the sequence-based models, Word2Vec and GloVe models from the Gensim library were selected. Finally, GPT-3 from OpenAI and BERT from the TensorFlow library were selected for the contextual-based mod- els. The output from the NLP pipeline was then inserted as input into the binary machine learning classifier, which was chosen to be the LinearSVC model from the scikit-learn library. The LinearSVC model was essentially a SVM that used a linear separator as described in 3.4. GPT-3 and BERT, on the other hand, did not need a stand-alone binary classifier as the models handled the classification internally. The main reasoning behind the selected combination of sub-tasks used for text normalization was to mimic the preprocessing of other pre-trained word embed- ding models [43] and maximize the gain from their usage by aligning the tokenized words. These pre-trained word embedding models were mainly related to the se- quential embeddings of Word2Vec and GloVe, where Gensim offered a wide variety of them. For this study, the word2vec-google-news-300 and glove-wiki-gigaword-300 34 4. Research Design Figure 4.6: The overall machine learning pipeline. From documents, into the machine learning pipeline which consists of an NLP pipeline and a machine learning classifier, and yields an output. Figure 4.7: Detailed view of the machine learning pipeline in Figure 4.6 including sub-tasks and models used. pre-trained embeddings were selected [50]. The word2vec-google-news-300 was a col- lection of 300-dimensional pre-trained word vectors and used Word2Vec as a basis for its learning technique. The model had been extensively trained on text corpora from Google News and had a total of three million vectors. Similarly, the glove-wiki- gigaword-300 was also a 300-dimensional pre-trained collection of word vectors but used GloVe as a basis for the learning technique. The model had been extensively trained on text corpora from Wikipedia, an online encyclopedia, and English Giga- word, which was an archive of newswire text data. Finally, glove-wiki-gigaword-300 had a total of 400 000 vectors [50]. These two pre-trained models are referred to as pre-trained Word2Vec and pre-trained GloVe henceforth. 35 4. Research Design 4.2.4 Using GPT-3 In this section, the methodology for using GPT-3 is presented. Firstly, the method for the few-shot mode is shown, which includes how the data was uploaded to the model and the different parameters. Secondly, the method for how the fined-tuned model was used is presented. 4.2.4.1 Uploading data For both the fine-tuning mode of GPT-3 and the few-shot mode, the data that the models used needed to be uploaded to OpenAI. At the time, OpenAI only supported the use of JSONL2 files which were similar to usual JSON files where the difference is that each line in the file is a valid JSON value. These were structured in the following way. {"text":"Text data of ISO certified policy + \n\n===\n\n", "label":"iso"} {"text":"Text data of a policy that is not certified + \n\n===\n\n", "label":"oth"} Where \n \n === \n \n was used as the separator. 4.2.4.2 Few-shot mode The data was uploaded to OpenAI for classification. Time.sleep(20) was used since it took some time for OpenAI to upload and process the text. def _upload_training_file(self,path): response = openai.File.create(file=open(path), purpose="classifications") time.sleep(20) self.training_file_id = response.id After the data was uploaded, a model was created, the hyperparameters for the model can be seen in table 4.2. The hyperparameter File id corresponded to the uploaded data and changed when using K-fold cross-validation, since one id was needed for each fold. Prompt was the input data that was classified. Setting Value Model name curie Prompt The text to be classified Search model ada Logprops 2 Max Examples 3 Expand Completion Training file id The id of the uploaded files. Table 4.2: The hyperparameters of GPT-3, in its few-shot setting, when classifying Information Security Policies. In this table, logprobs = 2 indicated that the model returned the logarithmic probability values. 2https://jsonlines.org/ 36 4. Research Design 4.2.4.3 Fine-tuning mode GPT-3’s fine-tuning mode worked similarly to the few-shot mode, where the main difference was that the fine-tuning mode was not yet supported by the OpenAI Python library, and therefore the OpenAI CLI and CURL commands needed to be used instead. The process can be described as follows, 1. Prepare the data: The OpenAI CLI provided commands which prepared the training data. The command established that the training data had a separator at the end of each prompt, that the label started with a whitespace, and that the JSONL file was formatted in the correct way. 2. Create fine-tuned model Using the OpenAI CLI, creating a model was done by executing the command openai API fine_tunes.create -t -m OpenAI responded with a model name which was saved, and this made it so that the model could be reused unless the model needed to be re-trained. 3. Using fine-tuned model Using the fine-tuned model was done by the fol- lowing OpenAI CLI command openai API completions.create -m -p Where FINE_TUNED_MODEL was the model name that was retrieved from the previous step, and PROMPT was a selected text from the testing dataset. To evaluate the model, this command needed to be executed for each value in the testing dataset. Then the returned value was compared to the true value to compute the F1-score. 4.3 Treatment Validation In this section, the metrics that were used to evaluate the models are presented, and methods of validation are described. The model validation in this study was made in three steps in order to answer RQ2 and a few of its facets. This section follows the same structure as those steps, with the exception of explaining the metrics and tools used for comparison first. The first step in the model validation process was to compare models and word em- bedding techniques in order to investigate which model features were best suited for each dataset. To perform this step, a collection of models were trained on a training set and then evaluated on a testing set. The results were then compared between the models in terms of F1-scores and accuracy and then analyzed in order to provide an answer for RQ2.1 and partially for RQ2.2. The second step was to validate the performance of the framework in combination with GPT-3 by comparing its results to an annotated validation dataset that had been annotated by a policy expert. This step was mainly performed to evaluate whether the framework and model functioned 37 4. Research Design as intended, i.e., if it could aid or replace an expert in determining completeness. The result from this step complemented the previous results and together provided a complete answer for RQ2.2. The third, and final step, was to investigate the use of GPT-3 in other domains, such as biomedicine, and compare them to the achieved performance in this study in order to determine whether GPT-3 was a good fit for information security classification. The results from the other domains was attained from various research papers that have used GPT-3 for text classifications. The result from this step provided an answer to RQ2.3 and thus also allowed for a discussion of RQ2. 4.3.1 Metrics and tools The metrics and tools used to evaluate and validate the various models are defined in this section. Besides checking the accuracy of how a model performed on a testing set, the metrics of precision, recall, and F1-score were used along with the method of K-fold cross-validation. 4.3.1.1 Precision, Recall and F1-score When dealing with imbalanced datasets evaluation metrics such as Precision and Recall is preferred [62]. The reason for this is that analyzing how many true positives a model produces can lead to high accuracy even though the model labels all texts the same. Therefore in this paper the chosen evaluation metrics were Precision, Recall, and F1-score. These are based on the Confusion Matrix seen in Table 4.3, and are defined in 4.1, 4.2 and 4.3. Predicted Positives Predicted Negatives Real Positives TP FN Real Negatives FP TN Table 4.3: Confusion Matrix, here TP = True Positives, FN = False Negatives, FP = False Positives, TN = True Negatives. precision = TP TP + FP (4.1) recall = TP TP + FN (4.2) F1-score = 2 · precision · recall precision + recall (4.3) 4.3.1.2 K-fold cross-validation K-fold cross-validation is a method for validating a model by partitioning the dataset into k subsets, training the model on k − 1 subsets, and testing the model on the 38 4. Research Design remaining subset [63]. This is repeated until all subsets have been used as the testing set. K-fold cross-validation was used across all data sets, with 75% of the total dataset being partitioned as a training set and the remaining 25% as a testing set. The final result was calculated as a mean of all results across the k subsets. 4.3.2 Model comparisons The model comparisons was performed by comparing GPT-3 to various benchmark models in order to validate its performance and answer the question of which charac- teristics were desired for the classification of the datasets. These benchmark models are defined in Table 4.4. Model name Word em- bedding type Word embedding model Classifier ZeroR - - Zero Rule Classifier TF-IDF Frequency TfidfVectorizer (scikit-learn) LinearSVC Word2Vec Sequential Word2Vec (Gensim) LinearSVC Word2Vec (pre- trained) Sequential word2vec-google-news-300 (Gensim) LinearSVC GloVe (pre- trained) Sequential glove-wiki-gigaword-300 (Gensim) LinearSVC BERT Contextual bert_en_uncased (TensorFlow) BERT Table 4.4: Table of benchmark models with their word embedding types and models as well as the combined classifier. In Table 4.4, the model names are given along with their word embedding type, word embedding model, and classifier coupled with. The same model names is ref- erenced in the upcoming sections. An additional model that has previously been left unnamed is also present in the table. The model, referred to as the ZeroR-model, is a Zero Rule Classifier. A ZeroR Classifier is a classifier model that predicts all data points to the most frequent class and is a common model baseline to validate performance against. Additionally, it is also a method for determining if the model in question is a useful predictor. To illustrate, a ZeroR model predicts the same as voting just ones or zeros in favor of the class that has the most dominance in any given dataset and ignores any possible predictors [64]. In the case of this study, the classifier predicted all data points as "Non-ISO" as the dataset was slightly leaning more towards Non-ISO labeled data points rather than ISO. The validation of GPT-3, on the other hand, consisted of evaluating the two mod- els, few-shot and fine-tuned, on the annotated dataset. For both models, there was one common parameter, the Model parameter. These consisted of Davinci, Curie, Babbage, and Ada. The differences between the models were which tasks they could perform well on, how costly they were to use, and how efficient they were when it 39 4. Research Design came to training. Overall, Davinci was the best performing model, but it was more costly and slower to train. On the other hand, Ada was the simplest model and therefore cheaper to use and trained in less time than Davinci. In this study, for GPT-3 Few-shot, Davinci and Curie were used as the classifier models, and Ada and Curie were used as search models. While for GPT-3 fine- tuned, Davinci and Ada were used as classifier models. For the few-shot model, three parameters were changed during the validation pro- cess. There were max_examples, which determined how many examples GPT-3 was given during interference time, and K, which determined how many subsets the dataset was divided into (see section 4.3.1.2), and lastly the model name. For the fined tuned model, only the model was changed. 4.3.3 Expert validation with the case company The expert validation was carried out with policy experts from the case company. This was done to determine how GPT-3’s predictions compared to experts in infor- mation security policies. The process consisted of the creation of a new dataset for each control, referred to as a validation dataset. These datasets each consisted of ten extracts from information security policies that corresponded to each selected control which no model had previously seen nor been evaluated on. The expert validation session was carried out in the form of a demo session where an expert was quickly briefed on the con