AI-Wolf in Sheep’s Clothing Distinguishing between Swedish humans and AI wannabes Master’s thesis in Computer science and engineering ADAM LANDBERG Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 AI-Wolf in Sheep’s Clothing Distinguishing between Swedish humans and AI wannabes ADAM LANDBERG Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 AI-Wolf in Sheep’s Clothing Distinguishing between Swedish humans and AI wannabes ADAM LANDBERG © ADAM LANDBERG, 2024. Supervisor: Paweł W. Woźniak, Department of Computer Science and Engineering Advisor: Kristina Knaving, RISE Research Institutes of Sweden Examiner: Marina Axelson-Fisk, Department of Mathematical Sciences Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: An AI-Wolf in Sheep’s Clothing Gothenburg, Sweden 2024 iv AI-Wolf in Sheep’s Clothing Distinguishing between Swedish humans and AI wannabes ADAM LANDBERG Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This report is investigating whether it is possible for Artificial Intelligence (AI) chatbots and AI detectors to detect texts created by AI in Swedish and in English. Mainly focusing on texts created by the AI chatbot ChatGPT 4, the performances of the AI detectors Smodin and Copyleaks are investigated. The research is moti- vated by a scarce previous research about AI in Swedish and an articulated need from Swedish schools to understand if it is possible to detect AI created content in homework and essays. By prompting the AI detectors with 400 articles written by humans in Swedish, 400 articles written by humans in English, 400 texts created by AI in Swedish, and 400 texts created by AI in English the AI detectors were thorogly examined. What could be proven was that Smodin showed an accuracy between 73.61% - 73.89% on a 99% confidence level on Swedish content, and an accuracy between 94.68% - 94.82% on a 99% confidence level on English content. On the other hand, Copyleaks showed an accuracy between 94.80% - 94.95% on a 99% confidence level on Swedish content, and an accuracy between 100% on a 99% confidence level on English content. This indicates that, whilst always sucessfully detecting texts written by humans, it is possible to detect Swedish text created by AI nine out of ten times. Keywords: Artificial Intelligence, AI, AI chatbots, AI detectors v Acknowledgements Firstly, I would like to express my warmest thank you to my fantastic, free thinking advisor Kristina Knaving for supporting me throughout the whole work. Her way of encouraging me and believing in me was exactly what I needed for this to work out. To be given a bun and a coffee, or a warming hug when so was needed, is what made the journey to finalise this work an easy ride. I am also thankful for being a part of your curios thinking, and I have learned much from your reflections. With that said: Thank you! Secondly, I would like to express my greatest thank you to my fascinating supervisor Paweł W. Woźniak for being there as my help in the work of making this research come true. I remember the time we were discussing solutions on how to gather the data sets of the main study, and how you suggested to structure the work. As it goes: Thank you! Finally, I would like to express my most humble thank you to my solution oriented examiner Marina Axelson-Fisk for helping me when things were unclear. I am espe- cially thankful for how you helped me with the calculation of the confidence intervals that can be found in section 3.2.2 Confidence intervals for the classifications. So to you: Thank you! I would also like to express my gratitude to the many more people that have reflected with me, that have held me up, that have pushed me to my limit, and, of course, celebrated with me! So to all of you: Thank you! Ingen nämnd - ingen glömd. Kärlek och respekt, Adam Landberg, Gothenburg, 2024-06-19 vii AI declaration The author confirms that AI was not used for the present report other than where so is stated. Adam Landberg, Gothenburg, 2024-06-19 ix Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 The problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 What is generative AI? . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 How is generative AI used? . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Problematic usage . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Counteracting problematic usage . . . . . . . . . . . . . . . . 4 1.4 The Swedish setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Method 9 2.1 The pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 The Swedish part of the pre-study . . . . . . . . . . . . . . . . 10 2.1.1.1 The performance of the AI chatbots in Swedish . . . 10 2.1.1.2 The performance of the AI detector in Swedsih . . . 10 2.1.1.3 The Swedish data of the pre-study . . . . . . . . . . 11 2.1.2 The English part of the pre-study . . . . . . . . . . . . . . . . 11 2.1.2.1 The performance of the AI chatbots in English . . . 12 2.1.2.2 The performance of the AI detector in English . . . . 12 2.1.2.3 The English data of the pre-study . . . . . . . . . . . 13 2.2 The main study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 The Swedish data of the main study . . . . . . . . . . . . . . 14 2.2.2 The English data of the main study . . . . . . . . . . . . . . . 15 3 Results 17 3.1 The pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 The Swedish part of the pre-study . . . . . . . . . . . . . . . . 17 3.1.1.1 The performance of the AI chatbots in Swedish . . . 17 3.1.1.2 The performance of the AI detector in Swedish . . . 18 3.1.2 The English part of the pre-study . . . . . . . . . . . . . . . . 19 3.1.2.1 The performance of the AI chatbots in English . . . 20 3.1.2.2 The performance of the AI detector in English . . . . 21 3.2 The main study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 xi Contents 3.2.1 The classifications . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Confidence intervals for the classifications . . . . . . . . . . . 24 4 Discussion 25 4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1 The pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.2 The main study . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Conclusion 29 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Further studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Bibliography 33 A The Swedish pre-study I B The English pre-study XXI C The Swedish main study XLIII D The English main study DLXXI E The summarised result from the main study MCCVII xii List of Figures 3.1 GPTZero probability span when classifying Swedish texts in the pre- study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Box plot of GPTZero probability span when classifying Swedish texts in the pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 GPTZero probability span when classifying English texts in the pre-study 21 3.4 Box plot of GPTZero probability span when classifying English texts in the pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 xiii List of Figures xiv List of Tables 3.1 Confusion matrix of classifications made by Copilot (Copilot (Mi- crosoft BING) when classifying Swedish texts in the pre-study . . . . . 17 3.2 Confusion matrix of classifications made by ChatGPT 3.5 when clas- sifying Swedish texts in the pre-study . . . . . . . . . . . . . . . . . . 18 3.3 Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b- instruct) when classifying Swedish texts in the pre-study . . . . . . . . 18 3.4 Confusion matrix of classifications made by GPTZero when classify- ing Swedish texts in the pre-study . . . . . . . . . . . . . . . . . . . . 18 3.5 Confusion matrix of classifications made by Copilot (Copilot (Mi- crosoft BING) when classifying English texts in the pre-study . . . . . 20 3.6 Confusion matrix of classifications made by ChatGPT 3.5 when clas- sifying English texts in the pre-study . . . . . . . . . . . . . . . . . . 20 3.7 Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b- instruct) when classifying English texts in the pre-study . . . . . . . . 20 3.8 Confusion matrix of classifications made by GPTZero when classify- ing English texts in the pre-study . . . . . . . . . . . . . . . . . . . . 21 3.9 Classification by Smodin on Swedish texts in the main study . . . . . 23 3.10 Classification by Smodin on English texts in the main study . . . . . . 23 3.11 Classification by Copyleaks on Swedish texts in the main study . . . . 23 3.12 Classification by Copyleaks on the English texts in the main study . . 23 3.13 Confidence intervals for the accuracies of the different classifications made in the main study clustered in groups of Swedish and English (as presented in the confusion matrices). . . . . . . . . . . . . . . . . 24 3.14 Confidence intervals for the accuracies of the different classifications made in the main study clustered in groups of human written articles and AI created texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xv List of Tables xvi 1 Introduction 1.1 The problem statement The 30th of November 2022 marks the date when the AI chatbot ChatGPT was introduced on the internet. This happening sparked a fire all over the world because of its ability to respond like a human. Ever since that day, more and more AI chatbots have popped up on the internet, some better performing than others. Like the corona pandemic, this widespread availability of AI chatbots has dramatically shifted the way humans live their everyday life. One can say, that as easy it was for humans to adapt to a remote way of working during the corona pandemic, as easy it has been to use and benefit from AI chatbots. As the availability of AI chatbots increase, the amount of texts created by AI chat- bots will increase too. As the performance of AI chatbots improves, it will get more and more difficult to distinguish texts created by AI from texts written by humans. And, as the usage of AI chatbots intensifies, the everyday human might start to think that texts created by AI actually are written by humans. The phenomenon of mistaking texts created by AI for texts written by humans, was studied by Hitsuwari et al. In 2023 they published a report named ”Does human-AI collaboration lead to more creative art? Aesthetic evaluation of human-made and AI-generated haiku poetry” in which they studied if it was possible for humans to distinguish between human made and AI made haiku (a Japanese poem of seventeen syllables). As to confirm the statement in the previous paragraph, Hitsuwari et al. could prove that it was impossible for humans to distinguish between haikus created by AI and haikus written by humans (Hitsuwari et al., 2023). The human inability to distinguish between longer texts created by AI and longer texts written by humans, had previously been studied by Partadiredja et al. (2020). They had found out that the ability of people to distinguishing between human made texts and AI-made text was 54%. In other words, the ability of humans to distinguish text created by AI from texts written by humans was proven not to be any better than the flipping of a coin. With the previously presented information at hand, one might ask if it is possible to develop an AI that could aid in the work of doing such a classification that humans fail to do. To rephrase, if humans fail to distinguish between texts created by AI and texts written by humans, could an AI do the job instead? 1 1. Introduction 1.2 What is generative AI? To start off, it is necessary to understand what is meant by an AI being able to do the job. In this case, what will be referred to as AI is what otherwise is called generative AI. It is important to understand that generative AI is not a digital copy of a brain, even though it might sometimes look like one. In the article ”How AI chatbots like ChatGPT or Bard work visual explainer” (The Guardian, 2023), the function of generative AI is explained as being inspired by how human brains work, with emphasis on inspired. Since it is fundamental to understand the function of generative AI to be able to follow the reasoning in this thesis, an explanation will follow below. The kind of generative AI that produce texts are usually referred to as Large Lan- guage Models (LLM ). These LLMs consists of connections and nodes that connect words, or parts of words, with each other. These connections between different nodes have different distances. Those distances are fundamental for the LLM because, as the distance between two nodes becomes smaller the probability of one of the nodes following the other one increases. (The Guardian, 2023) More technically speaking, each node (i.e. word or part of a word) is saved in the model as a token, and the distance between different nodes are saved as vectors. In this way, by using the tokens and the vectors, the nodes can be mapped in a multidimensional space. In this space, the distance between different nodes are what makes it possible for the LLM to predict the next node (i.e. word or part of a word) in a sequence. (The Guardian, 2023) With what has been previously stated, it is appropriate to make it clear that an LLM uses mathematical processes to predict the next token in a string of text. However, the LLM does not only look at the last node in the relevant string to predict the next one. Instead, the LLM uses other nodes in the string to understand the context. To explain what that means, it is time to look at an example taken from The Guardian (2023): Let us consider the following situation where you are asking the LLM two questions: Question: What is a tidy thing to eat pasta with? Answer: A tidy thing to eat pasta with is a fork. and Question: What is a nice thing to eat pasta with? Answer: A nice thing to eat pasta with is pesto. This example has been taken from The Guardian (2023) Notice how the LLM model recognizes the differences in the sentences even though the words that make the difference were not the last words in any of the text strings! This is because the LLM takes all words it recognizes in the question text string, weights them with each other, and put certain emphasis on the words eat, pasta, tidy, and nice to make the answers appropriate. In the same way, the word with will be reweighed depending on the question asked. For the question containing the 2 1. Introduction word tidy the word with will most likely be associated with words like ”using” and ”by means of”, whilst for the question containing the word nice the word with will most likely be associated with words like ”accompanied by”. (The Guardian, 2023) With this said, the important takeaway is to understand that the LLM does not understand a text as a human do, instead it is a mathematical model that predicts the most likely next word in a sequence of words or sentences. 1.3 How is generative AI used? With the LLM explained, it is necessary to understand how generative AI is used today. One of the most well known generative AI is called ChatGPT, which is an AI chatbot that was released in November 2022. In January 2023, Reuters (2023) reported that ChatGPT had been used by 100 million users that certain month only. If that was not enough, in the same month, ChatGPT had had 13 million unique users per day (Reuters, 2023). This increased usage of AI chatbots had an effect in many parts of society, including academia. In the same reporting as was previously mentioned, Reuters made an indicating statement presenting ”questions about facilitation of academic dishonesty and mis- information” (Reuters, 2023). The reporting from Reuters was not the only one introducing question about AI and academia. On the 10th of January 2023, Forbes staff Bushard (2023) reported that ”[f]ake scientific abstracts and research papers generated using OpenAIs highly-advanced chatbox ChatGPT fooled scientists into thinking they were real reports nearly one-third of the time, [...]”. Half a year after Bushard’s reporting, Bisi et al. (2023) presented a study on how much AI created content 425 articles, published before and after the launch of ChatGPT, contained. The research ”[...] showed an increase in AI generation for complete articles [...] and for the body of articles [...]”. The authors also argue that AI content had been present already before the wide spread of ChatGPT, but their findings could how- ever confirm that the presence of AI content had significantly increased over time nonetheless. (Bisi et al., 2023) 1.3.1 Problematic usage As earlier presented , the usage of generative AI is high and widely spread in society. However, what has not yet been covered is the potential impact the availability of AI has had on the society. Therefore, it is time to look at an article named ”The World of Generative AI: Deepfakes and Large Language Models” by Mitra et al. (2024). In this article, the authors are covering the harmful ways in which LLMs can be used. More specifically, the authors are investigating the topic of deepfakes as they are made possible through the use of LLMs. The authors state that the mere existence of deepfakes is not problematic, however they claim that, as it gets attention by the global hacking community, the risk of AI-crimes increasing significantly seems as given, rather than only a possibility. In other words, it should be expected that AI-crimes like theft, fraud, forgery and personation (King et al., 2020) will increase as a result of the wide spread and wide usage of generative AI. 3 1. Introduction Another aspect of problematic usage of LLMs, is the spreading of fake news. This phenomenon was studied by Sun et al. (2024) who tested real news articles against fake news articles on humans and AI detecting algorithms. More specifically, they tested fake news, created by prompting different AI algorithms with statements ts create fake news, against real news articles written by humans. What the authors found was that the fake news articles were difficult to distinguish from the real news articles. Hence, the authors could claim that, for both humans and AI, it was difficult to distinguish between human written articles and AI created texts (Sun et al., 2024). What that means is that, when humans cannot distinguish fake news from real news, and when AI cannot help humans make that distinguishment, the risk of fake news spreading as facts increases. One more article looking into the topic of detecting fake information is ”Disinforma- tion Detection: An Evolving Challenge in the Age of LLMs” by Jiang et al. (2024). In that article, the researchers studied the question of how current state-of-the-art detection models can detect LLM-generated disinformation. Firstly, they concluded that the current state-of-the-art detection models could not detect LLM-generated disinformation. Therefore, they continued their research with asking if LLMs could be used instead. The authors could conclude that the technique using LLMs also faced challenges and was therefore neither seen as a sufficient solution (Jiang et al., 2024). A concrete example of a study trying to detect fake information in an everyday setting was presented by Ayoobi et al. (2023) who studied LinkedIn profiles gener- ated by LLM. One of the main reasons the authors studied this phenomenon was because these fake accounts rapidly could access private and sensitive information from human users on the platform. Also, as a fake account stays active, it gains more and more credibility leading to an increased risk of it engaging in phishing and other kinds of scamming. This phenomenon was also covered by Beyond Identity (2023) which reported that ”39% of people said they would fall victim to at least one phishing scam generated by ChatGPT.” In other words, fake information on social media platforms lead to an increased level of AI-crimes. (Ayoobi et al., 2023) 1.3.2 Counteracting problematic usage What seems to be one solution to mitigating the problem with the spreading of fake information, is the need to recognise AI created content. In the previously mentioned research made by Ayoobi et al. (2023), the authors presented a possibility to mitigate the problems with fake accounts created using LLM. What they could prove was that, by using a certain developed algorithm it was possible to detect profiles created by LLM at an accuracy of at least 90% in English. However, Liang et al. (2024) could show that ”[...] roughly 7-15% of sentences in ML conference reviews were substantially modified by AI beyond a simple grammar check [...]”. Adding to that, the authors also showed that the words commenadable, innovative, meticulous, intricate, notable, and versatile had increased significantly in the ICLR 2024 (International Conference on Learning Representations 2024) peer reviews compared to the years before, indicating an increased level of AI usage in 4 1. Introduction these articles. What this confirms, is that the use of LLM is wide spread and has, to this day, been used in scientific writings without being detected. Thus, whilst Ayoobi et al. (2023) confirm that it is possible to detect AI content on social media accounts, Liang et al. (2024) claims that doing so in academic writing seems rather difficult. As a consequence of these problems, so called AI detectors have popped up over the internet. Those AI detectors claim to be able to detect AI created texts. In a preliminary study, made by Habibzadeh (2023), one of these AI detectors named GPTZero was tested. In that study 30 English texts written by humans and 20 English texts created by AI was put into the GPTZero for classification. The result was a accuracy of 80%, a specificity of 90%, and a sensitivity of 65%. That could be perceived as a good performance since it seems to not be guessing the results, however the accuracy is too low for GPTZero to be used in a reliable way in, for example, a school setting (Habibzadeh, 2023). Another research looking into the performance of AI detectors was performed by Chaka (2023) in a study looking into AI generated texts from the different LLMs and how well five different AI detectors could detect them. What they could conclude, was a confirmation of the findings made by (Habibzadeh, 2023) and they stated: ”[...] it is evident from this paper that all five AI content detectors seem not yet fully ready to accurately and convincingly detect AI-generated content from machine- generated texts in different contexts. This has dire consequences for AI-generated plagiarism in academic essay writing.” Source: Chaka (2023) To understand the inability of AI detectors to accurately detect AI created content, it is necessary to understand the functionality of AI detectors. Ma et al. (2023) presented a reasonable exposition on how AI tools could work by focusing on the use of perplexity as a tool when developing and employing the relevant AI. What is perplexity? When using perplexity in relation to AI, and more specifically Natural Language Pro- cessing (NLP), it refers to the process of guessing the next word in a text string. The better the AI is at guessing the next word, the lower the perplexity score gets. When an AI language model is developed, it is usually designed to produce texts with low perplexity. On the other hand side, when humans produce texts these often show of higher perplexity. Thereby, perplexity can be used as a tool for distinguishing between texts created by AI (which should show of lower levels of perplexity) and texts written by humans (which should show of higher levels of perplexity). Source: Stephen M. Walker II (n.d.) 5 1. Introduction Ma et al. (2023) could claim that the AI detector makes the distinguishment better than what a human do, but they were unsure about the future performance. They claimed that the AI detector will become worse and worse at doing this job as the available AI language models develop and become more advanced. Put differently, as AI language models develops they will most probably imitate humans more and more, consequently leading to an increased level of perplexity also in texts created by AI. Thereby, Ma et al. (2023) instead claim that the future focus of detecting AI created content should focus on semantics (the meaning of words and sentences) and pragamatics (the meaning of phrases in different contexts) rather than perplexity. This finding confirms what was also found by Jiang et al. (2024) when they claimed that LLMs were not successful in the work of detecting AI content, and instead having to recognise contextual elements to do so. This need for shift in focus regarding how AI detectors should work, is insinuated by Elkhatat et al. (2023) in their article ”Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text”. The authors agree with Ma et al. (2023) regarding that it is indeed possible for AI detectors to recognise AI created content, but they emphasise the need to be mindful about the capabilities of generative AI as the AI models advance. This claim was supported by their findings about how AI detectors performed inconsistently and with a worse performance on detecting ChatGPT 4 content compared to ChatGPT 3.5 content (Elkhatat et al., 2023), which also confirms the conclusion made by Ma et al. (2023). With this said, it comes clear that the work of classifying AI texts is a job that is in constant need of development. One of these kinds of developments was suggested by Cingillioglu (2023) who could show that, by using a Support Vector Machine (SVM), the work of identifying human written essays could reach an accuracy of 100%. However, this came at a cost of an overall accuracy of the model of 92.7% indicating a lower accuracy for detecting AI generated texts. The impaired ability to detect AI content was though perceived as acceptable because of a zero tolerance of any human written essays being classified as created by AI. Another perspective that sometimes is claimed as being useful in the work of de- tecting AI created content is detecting hidden patterns in texts, which usually is referred to as watermarking. One of these available watermarking techniques was presented by Kirchenbauer, Geiping, Wen, Katz, et al. (2023). Their technique can be described as focused on the number of occurrences of certain preferred words in a text string. When a text string is examined and the limit of preferred words has been exceeded, the text string should be perceived as generated by an AI. Their proposed solution is later strengthened by findings by Kirchenbauer, Geiping, Wen, Shu, et al. (2023) (note: three of the authors in this article is the same as in the previously mentioned article) and they claim: ”We advocate for a view of watermarking reliability as a function of text length, and find that even human writers cannot reliably remove watermarks if being measured at 1000 words, despite having the goal of removing the watermark.” Source: Kirchenbauer, Geiping, Wen, Shu, et al. (2023) 6 1. Introduction This technique was later criticised by Weichert and Dimobi (2024), who argue that the assumptions to the solution does not hold. They instead claim that texts writ- ten by humans are not as random as Kirchenbauer, Geiping, Wen, Shu, et al. (2023) assert, and instead mean that human writing does not have a uniformly distributed frequency of word usage. From that statement, Weichert and Dimobi (2024) could show that ”[...] the watermarking algorithm does not accurately model human lan- guage by assuming that tokens in human text are uniformly distributed” and they propose further studies on the reliability of the watermarking technique. What all this stated, it seems like there exist solutions on how to recognise AI content on the market. However, as the performances of these AI detectors are claimed to be on a certain level, there is nothing really proving so is actually the case. 1.4 The Swedish setting In previous sections, the increasing presence of AI in the everyday human’s life, and the lack of proper detection tools for detecting texts created by AI, has been presented. Worth noticing is that the previously presented studies have all been performed in English, meaning there is much less to conclude for languages other than English. As for Swedish, (which for the record is a much less wide spread language when compared to English) it is hard to find any significant studies made on AI detectors and their performance. What could be said about AI and Swedish is though that AI chatbots, like ChatGPT 3.5, is already capable of creating Swedish texts. Thus, an ability to detect texts created by AI in Swedish becomes important. As shown before, plenty of studies have claimed that AI detectors can detect English texts created by AI, but that the AI detectors are imperfect. This raises questions about how these AI detectors could be performing on Swedish content. There could be a situation where the differences in languages between Swedish and English would cause the AI detectors to perform better on Swedish than on English; worse on Swedish than on English; or the same on both Swedish and on English.∗ Detecting AI content is a growing interest. One of the reasons for this is because of the fast rate in which AI can produce content (both true and false) and, consequently, how fast this produced content can be spread. Another reason is the need to know whether something is written by a certain person, for example in school. On the 30th of August 2023 the Swedish public service television broadcaster Sveriges Television (SVT ) reported that: ”More than half of the young people in a recent survey from ”Ungdomsbarometern” answer that they have, at some point, used AI for a school assignment in a way that they themselves are not sure is allowed.” (free translation). Source: SVT Nyheter (2023b) Already in April 2023, SVT had reported on the matter of pupils using AI to cheat in school. That certain reporting was regarding the national exams that could ∗What is mentioned as ”to perform” is referring to the accuracy of the different AI detectors. 7 1. Introduction be prepared at home, and how AI could be used to cheat on those examinations. A certain teacher that was interviewed in the article claimed that all submissions should be completely banned (SVT Nyheter, 2023a). Consequently, the Swedish National Agency for Education, Skolverket, published information on their website about AI in school in the beginning of September 2023. In that information Skolverket stated that: ”There are currently no reliable tools to determine whether a text is written by a chatbot. The Swedish National Agency for Education advises against submitting as- signments as a basis for grades if you, as a teacher, cannot ensure the reliability of the content.” (free translation). Source: Skolverket (2023) As shown, it is difficult for both pupils and teachers to orientate around the topic of AI. There still seems to exist a belief that it is possible to ask an AI if a text was written by an AI or a human, and to believe to be given a trustworthy answer. This is today unclear, especially in a Swedish setting, and research is therefore needed in the area. 1.5 Purpose of the study With all this stated, it is interesting to study whether AI-models could be better performing than humans and thereby help humans make a more accurate prediction of whether a text in Swedish is written by a human or created by an AI. That is to say: Could AI do the job of distinguishing between texts created by AI and texts written by humans in Swedish? The framing of this study is as follows: Three AI chatbots and three AI detectors will be tested on texts created by AI and texts (more specifically: articles) written by humans, in both English and in Swedish. That is made to see how well the AI chatbots and the AI detectors perform on the different inputs. The study aims to find how well these different AI chatbots and AI detectors perform when asked to detect whether a content is made by an AI or by a human. The gaps in knowledge that this research seeks to fill is the lack of proper studies on how AI detection could be carried out in Swedish. There is a lot of studies on logic behind the function of AI detection in the English language, but there are no real studies on Swedish. More specifically, there is no studies on how well these self proclaimed AI detection tools perform in other languages than English. Adding to that the requests from Swedish schools to get help with how to work around the issue with pupils handing in homework and essays not written by themselves but rather by using an AI, strengthens the reasons for carrying out this kind of research. 8 2 Method This chapter is divided into two parts: the pre-study and the main study. Since a study like this has not been taking place in a Swedish setting before, an initial study (which will be referred to as the pre-study) was firstly undertaken. That study tested three different AI chatbots and one AI detector on their ability to detect AI created content. The data these AI chatbots and the AI detector were tested on, was a dataset of in total 16 articles written by humans in Swedish, 16 articles written by humans in English, 16 texts created by AI in Swedish, and 16 texts created by AI in English. The results from the pre-study was then used to design the succeeding main study. The main study was designed and undertaken to study two AI detectors in more depth. The data these AI detectors were tested on was a dataset of in total 400 articles written by humans in Swedish, 400 articles written by humans in English, 400 texts created by AI in Swedish, and 400 texts created by AI in English. The findings from the main study are then presented as the main findings of this research. More detailed information is presented below. 2.1 The pre-study The data referred to in the whole of section 2.1 The pre-study can be found in Appendix A (for the Swedish parts) and Appendix B (for the English parts). The pre-study was undertaken to make an initial attempt to map the environment of how AI classifies texts as either written by a human or created by an AI. To make this pre-study the following three different AI chatbots were chosen: • Copilot (Microsoft BING) • ChatGPT 3.5 • GPT-SW3 (gpt-swe-20b-instruct) The choice of using Copilot (Microsoft BING) and ChatGPT 3.5 was because of them being free to use and widely available. The reason for using GPT-SW3 (gpt- swe-20b-instruct) was because it is the first version of a GPT that was made specif- 9 2. Method ically to be used in Swedish and because it was offered for free. Adding to those three chatbots, a so called ”AI detector” was also chosen. The choice fell on the tool GPTZero for its wide spread in the area of this kind of classifications and because of it being free to use. 2.1.1 The Swedish part of the pre-study To try to make an initial attempt to answer the question if it is possible to distinguish between human written articles and AI created texts in Swedish, the Swedish data set was put in a random order as presented in Appendix A. The data points were then put in an orderly manner into the following three different AI chatbots: • Copilot (Microsoft BING) • ChatGPT 3.5 • GPT-SW3 (gpt-swe-20b-instruct) 2.1.1.1 The performance of the AI chatbots in Swedish When it comes to the AI chatbots, the prompt with which they all were prompted with, was: Är den här texten skriven av en AI eller en människa: [current data point from the Swedish data set] The Swedish prompt would mean this in English: Is this text written by an AI or a human: [current data point from the Swedish data set]. This was made for each data point in the Swedish data set in the order presented in Appendix A. All answers from each AI chatbot was recorded and marked as Human, AI or No classification based on the response from the AI chatbots. The classification was made like this: • If the AI chatbot responded something like ”This text appears to have been written by a human.” or in any other similar way, the classification was recorded as Human. • If the AI chatbot responded something like ”This text appears to have been written by an AI.” or in any other similar way, the classification was recorded as AI. • If the AI chatbot responded something like ”It is uncertain whether the text was written by a human or an AI.” or in any other similar way, the classification was recorded as No classification. 2.1.1.2 The performance of the AI detector in Swedsih When it comes to the AI detector, the text was loaded into the AI detector without any additional prompt and the AI detector responded with a probability of the text 10 2. Method being written by a human, an AI, or if it was a mix between them both. When sum- ming the three different probabilities they summed up to 100%. The classification was made like this: • If the AI detector responded with a probability greater than 50% of the text being written by a human, the classification was recorded as Human. • If the AI detector responded with a probability greater than 50% of the text being created by an AI, the classification was recorded as AI. • If the AI detector responded with a probability greater than 50% of the text being written by a human and an AI together, if the AI detector responded with a probability lower than 50% of the text being written by a human, or if the AI detector responded with a probability lower than 50% of the text being written by an AI, the classification was recorded as No classification. 2.1.1.3 The Swedish data of the pre-study To test the performance of the three AI chatbots and the AI detector on Swedish content, 16 articles mentioned in the article ”Från chockgranater och laxkanoner till coronakris – här är nyhetsåret 2020” by the Swedish national public television broadcaster, SVT, were collated (SVT Nyheter, 2020). They were then ordered and given an index between H1 and H16 (H standing for the text being written by a human). Following, the headlines of every indexed article was extracted and turned into sixteen prompts that could be put into an AI chatbot. The prompts were written in Swedish like this: Skriv en kort nyhetsartikel med rubriken [headline of pooled article]. The Swedish prompt would mean this in English: Write a short news article with the headline [headline of pooled article]. The chosen AI chatbot to respond to the prompts was ChatGPT 3.5. The 16 prepared prompts were put into the AI and the responses from ChatGPT 3.5 were ordered and given an index between A1 and A16 (A standing for the text being created by an AI). With the human written texts and the AI created texts at hand, a data set of 32 data points were made consisting of 16 human written texts in Swedish (H1-H16) and 16 AI created texts in Swedish (A1-A16). 2.1.2 The English part of the pre-study To try to make an initial attempt to answer the question if it is possible to distinguish between human written articles and AI created texts in English, the English data set was put in a random order as presented in Appendix B. The data points were then put in an orderly manner into the following three different AI chatbots: • Copilot (Microsoft BING) 11 2. Method • ChatGPT 3.5 • GPT-SW3 (gpt-swe-20b-instruct) 2.1.2.1 The performance of the AI chatbots in English When it comes to the AI chatbots, the prompt with which they all were prompted with, was: Is this text written by an AI or a human: [current data point from the English data set] This was made for each data point in the English data set in the order presented in Appendix B. All answers from each AI chatbot was recorded and marked as Human, AI or No classification based on the response from the AI chatbots. The classification was made like this: • If the AI chatbot responded something like ”This text appears to have been written by a human.” or in any other similar way, the classification was recorded as Human. • If the AI chatbot responded something like ”This text appears to have been written by an AI.” or in any other similar way, the classification was recorded as AI. • If the AI chatbot responded something like ”It is uncertain whether the text was written by a human or an AI.” or in any other similar way, the classification was recorded as No classification. 2.1.2.2 The performance of the AI detector in English When it comes to the AI detector, the text was loaded into the AI detector without any additional prompt and the AI detector responded with a probability of the text being written by a human, an AI, or if it was a mix between them both. When sum- ming the three different probabilities they summed up to 100%. The classification was made like this: • If the AI detector responded with a probability greater than 50% of the text being written by a human, the classification was recorded as Human. • If the AI detector responded with a probability greater than 50% of the text being created by an AI, the classification was recorded as AI. • If the AI detector responded with a probability greater than 50% of the text being written by a human and an AI together, if the AI detector responded with a probability lower than 50% of the text being written by a human, or if the AI detector responded with a probability lower than 50% of the text being written by an AI, the classification was recorded as No classification. 12 2. Method 2.1.2.3 The English data of the pre-study To test the performance of the three AI chatbots and the AI detector on English content, four articles authored by the BBC (British Broadcasting Corporation), four articles authored by the RTÉ (Raidió Teilifís Éireann), four articles authored by the CNN (Cable News Network), and four articles authored by the ABC (American Broadcasting Company) were collated as 16 articles in English authored by humans. They were then ordered and given an index between h1 and h16 (h standing for the text being written by a human). Following, the headlines of every indexed article was extracted and turned into sixteen prompts that could be put into an AI chatbot. The prompts were written in English like this: Write a short news article with the headline [headline of pooled article]. The chosen AI chatbot to respond to the prompts was ChatGPT 3.5. The 16 pre- pared prompts were put into the AI and the responses from ChatGPT 3.5 were ordered and given an index between a1 and a16 (a standing for the text being created by an AI). With the human written texts and the AI created texts at hand, a data set of 32 data points were made consisting of 16 human written texts in English (h1-h16) and 16 AI created texts in English (a1-a16). 2.2 The main study The data referred to in the whole of section 2.2 The main study can be found in Appendix C (for the Swedish parts), Appendix D (for the English parts), and Appendix E (for the overall result). Based on the results from the pre-study, it came clear that there was no need to try to test the different AI chatbots to any greater extent, since they showed of bad performances. However, studying the performances of the self proclaimed AI detectors came as a more reasonable study to perform. In the main study the performance of AI detectors classifying texts as either written by a human or created by an AI were taken to a greater extent than in the pre-study. However, in the main study GPTZero was excluded and replaced by two other AI detectors, namely: • Smodin • Copyleaks The choice fell upon these two AI detectors mainly because of the reason that GPTZero had showed of really bad performance in Swedish in the pre-study. It was therefore a necessity to choose new AI detectors, that claimed to be able to detect AI content in both English and Swedish. As such, Smodin and Copyleaks were the 13 2. Method only two AI detectors that claimed to be able to make this kind of distinguishment and they were thereby chosen. The main study is an up-scaled version of the method presented by Habibzadeh (2023) in his article ”GPTZero performance in identifying artificial intelligence- generated medical texts: a preliminary study”. This meant that the Swedish and the English datasets of 800 data points each, were loaded into the AI detectors without any additional prompt. On the prompt, the AI detectors responded with a probability of the text being written by an AI. The classification was made like this: • If the AI detector responded with a probability smaller than 50% of the text being written by an AI, the classification was recorded as Human. • If the AI detector responded with a probability of 50% or greater of the text being created by an AI, the classification was recorded as AI. • If the AI detector responded with a probability lower than 0% or higher than 100% of the text being written by an AI, the classification was recorded as Error. When choosing articles to classify, articles published before 2021 was selected. If these articles had been updated after the end of 2020 though, they were excluded. 2.2.1 The Swedish data of the main study To test the performance of the two AI detectors on Swedish content, articles written by humans and texts created by AI was needed. Therefore, 400 articles were selected from the dataset News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021 (Dahlgren, n.d.). The selection was made as following: 1. The references marked as both page_type: specific article and article_genre: news article were selected, and the other references were excluded. 2. Then all the references that were marked as having a paywall were excluded. 3. Then all references published after 2020-12-31 were excluded. 4. Then all references updated after 2020-12-31 were excluded. 5. Then all references were given a random value between 0 and 1. 6. Then the references were sorted in ascending order. The process then started to go throguh every reference in an orderly manner. This meant manually going through the references until having found 400 articles that met the needs for the study. When going through the references, references to dead ends (usually because the article had been deleted) were excluded. Also articles on the form of bullet lists were excluded. In total 877 references were checked before concluding in 400 articles, meaning that 477 references were passed. When having found 400 articles, they were collated into a data set of 400 data points of Swedish articles written by humans. 14 2. Method Adding to the 400 articles written by humans were 400 pooled texts created by AI. To get 400 AI created texts, a procedure as following was followed: 1. Each article (data point) in the datset of Swedish articles were put in the ChatGPT 4 AI chatbot with a prompt on this form: Sammanfatta följande artikel i en mening: [the pooled article]. (The Swedish prompt would mean this in English: Summarize the following article in one sentence: [the pooled article].) 2. The summarizations were collated into a dataset of in total 400 summarizations of the Swedish human written articles. 3. Each one sentence summarization (data point) in the datset of summarizations of the Swedish human written articles were put in the ChatGPT 4 AI chatbot with a prompt on this form: Skriv en artikel om: [the one sentence summarization of the pooled Swedish article]. (The Swedish prompt would mean this in English: Write an article about: [the one sentence summa- rization of the pooled Swedish article].) 4. Each text created by the AI were saved. When having been given 400 texts created by AI, they were collated into a data set of 400 data points of Swedish texts created by AI. Now, the data set of 400 data points of Swedish articles written by humans was accompanied by the data set of 400 data points of Swedish texts created by AI. Thus, they now constituted a merged data set of, in total, 800 data points of Swedish articles written by humans and texts created by AI. 2.2.2 The English data of the main study To test the performance of the two AI detectors on English content, articles written by humans and texts created by AI was needed. Therefore, 400 articles were selected from the dataset All the News 2.0 – 2.7 million news articles and essays from 27 American publications (Thompson, n.d.). The selection was made as following: 1. The references published in any other year than 2020 were excluded. 2. Then all references were given a random value between 0 and 1. 3. Then the references were sorted in ascending order. The process then started to go throguh every reference in an orderly manner. This meant manually going through the references until having found 400 articles that met the needs for the study. When going through the references, references to articles updated after 2020-12-31 were excluded. Also, references to dead ends (usually because the article had been deleted) and articles on the form of bullet lists were excluded. In total 852 references were checked before concluding in 400 articles, meaning that 452 references were passed. When having found 400 articles, they were collated into a data set of 400 data points of English articles written by humans. 15 2. Method Adding to the 400 articles written by humans were 400 pooled texts created by AI. To get 400 AI created texts, a procedure as following was followed: 1. Each article (data point) in the datset of English articles were put in the Chat- GPT 4 AI chatbot with a prompt on this form: Summarize the following article in one sentence: [the pooled article]. 2. The summarizations were collated into a dataset of in total 400 summarizations of the English human written articles. 3. Each one sentence summarization (data point) in the datset of summariza- tions of the English human written articles were put in the ChatGPT 4 AI chatbot with a prompt on this form: Write an article about: [the one sentence summarization of the pooled Swedish article]. 4. Each text created by the AI were saved. When having been given 400 texts created by AI, they were collated into a data set of 400 data points of English texts created by AI. Now, the data set of 400 data points of English articles written by humans was accompanied by the data set of 400 data points of English texts created by AI into a merged data set of, in total, 800 data points of English articles written by humans and texts created by AI. 16 3 Results This chapter is divided into two parts. First comes the part presenting the results from the pre-study, and then comes the part presenting the results from the main study. 3.1 The pre-study Below is the results from the so called pre-study presented. Starting with a section focusing on the Swedish data set, and ending with a section focusing on the English data set. 3.1.1 The Swedish part of the pre-study In this section, the result from the Swedish pre-study is presented. In the same way as Habibzadeh (2023) presented his findings, the findings from this study will be presented by the use of confusion matrices. The first part focuses on the performance of the different AI chatbots, and the second part focuses on the performance of the AI detector. 3.1.1.1 The performance of the AI chatbots in Swedish The 32 prompts, followed by the relevant Swedish data point, were put into each AI chatbot. The performance of the different AI chatbots are as follows: Table 3.1: Confusion matrix of classifications made by Copilot (Copilot (Microsoft BING) when classifying Swedish texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 59% Pos (AI) 11 0 5 Sensitivity: 69% Neg (Human) 6 2 8 Specificity: 50% 17 3. Results Table 3.2: Confusion matrix of classifications made by ChatGPT 3.5 when classify- ing Swedish texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 22% Pos (AI) 0 13 3 Sensitivity: 0% Neg (Human) 0 9 7 Specificity: 44% Table 3.3: Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b- instruct) when classifying Swedish texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 56% Pos (AI) 3 2 11 Sensitivity: 19% Neg (Human) 1 0 15 Specificity: 94% 3.1.1.2 The performance of the AI detector in Swedish Each Swedish data point was then put into the AI detector, resulting in a total amount of inputs of 32 times. The performance of the AI detector is as follows: Table 3.4: Confusion matrix of classifications made by GPTZero when classifying Swedish texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 50% Pos (AI) 0 0 16 Sensitivity: 0 Neg (Human) 0 0 16 Specificity: 100% To each prediction, GPTZero presents a probability span on how much it believes a text to be entirely written by a human, AI or a mix of the two. The result on the 32 data points in the data set is presented in figure 3.1. 18 3. Results Figure 3.1: GPTZero probability span when classifying Swedish texts in the pre-study A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 A 9 A 10 A 11 A 12 A 13 A 14 A 15 A 16 H 1 H 2 H 3 H 4 H 5 H 6 H 7 H 8 H 9 H 10 H 11 H 12 H 13 H 14 H 15 H 16 0 10 20 30 40 50 60 70 80 90 100 % of pr ob ab ili ty Human Mixed AI As can be seen in figure 3.1, GPTZero presents a constant probability of every text being created by AI at 0%. In figure 3.2 below is a boxplot presented to show the variance of the calculated probability of the GPTZero algorithm. Figure 3.2: Box plot of GPTZero probability span when classifying Swedish texts in the pre-study 0 10 20 30 40 50 60 70 80 90 100 H1-H16 % Mixed A1-A16 % Mixed H1-H16 % Human A1-A16 % Human 3.1.2 The English part of the pre-study In this section, the result from the English pre-study is presented. In the same way as Habibzadeh (2023) presented his findings, the findings from this study will be presented by the use of confusion matrices. The first part focuses on the performance of the different AI chatbots, and the second part focuses on the performance of the AI detector. 19 3. Results 3.1.2.1 The performance of the AI chatbots in English The 32 prompts, followed by the relevant English data point, was put into each AI chatbot. The performance of the different AI chatbots are as follows: Table 3.5: Confusion matrix of classifications made by Copilot (Copilot (Microsoft BING) when classifying English texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 50% Pos (AI) 0 0 16 Sensitivity: 0 Neg (Human) 0 0 16 Specificity: 100% Table 3.6: Confusion matrix of classifications made by ChatGPT 3.5 when classify- ing English texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 41% Pos (AI) 0 13 3 Sensitivity: 0 Neg (Human) 0 3 13 Specificity: 81% Table 3.7: Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b- instruct) when classifying English texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 63% Pos (AI) 9 1 6 Sensitivity: 56% Neg (Human) 5 0 11 Specificity: 69% 20 3. Results 3.1.2.2 The performance of the AI detector in English Each English data point was then put into the AI detector, resulting in a total amount of inputs of 32 times. The performance of the AI detector is as follows: Table 3.8: Confusion matrix of classifications made by GPTZero when classifying English texts in the pre-study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 100% Pos (AI) 16 0 0 Sensitivity: 100% Neg (Human) 0 0 16 Specificity: 100% To each prediction, GPTZero presents a probability span on how much it believes a text to be entirely written by a human, AI or a mix of the two. The result on the 32 data points in the data set is presented below. Figure 3.3: GPTZero probability span when classifying English texts in the pre-study a1 a2 a3 a4 a5 a6 a7 a8 a9 a1 0 a1 1 a1 2 a1 3 a1 4 a1 5 a1 6 h1 h2 h3 h4 h5 h6 h7 h8 h9 h1 0 h1 1 h1 2 h1 3 h1 4 h1 5 h1 60 10 20 30 40 50 60 70 80 90 100 % of pr ob ab ili ty Human Mixed AI As can be seen in figure 3.3, GPTZero presents a good confidence in each correct classification. In figure 3.4 below is a boxplot presented to show the variance of the calculated probability of the GPTZero algorithm. 21 3. Results Figure 3.4: Box plot of GPTZero probability span when classifying English texts in the pre-study 0 10 20 30 40 50 60 70 80 90 100 h1-h16 % AI a1-a16 % AI h1-h16 % Mixed a1-a16 % Mixed h1-h16 % Human a1-a16 % Human 22 3. Results 3.2 The main study Below is the results from the main study presented. Likewise as Habibzadeh (2023) presented his findings, the results are presented by the use of confusion matrices. 3.2.1 The classifications The results are presented in four different confusion matrices as follows: Table 3.9: Classification by Smodin on Swedish texts in the main study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 74% Pos (AI) 193 12 195 Sensitivity: 48% Neg (Human) 2 1 397 Specificity: 99% 13 data points were classified as ”No classification”. Table 3.10: Classification by Smodin on English texts in the main study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 95% Pos (AI) 373 18 9 Sensitivity: 93% Neg (Human) 14 1 385 Specificity: 96% 19 data points were classified as ”No classification”. Table 3.11: Classification by Copyleaks on Swedish texts in the main study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 95% Pos (AI) 359 0 41 Sensitivity: 90% Neg (Human) 0 0 400 Specificity: 100% 0 data points were classified as ”No classification”. Table 3.12: Classification by Copyleaks on the English texts in the main study PREDICTION A C T U A L Pos (AI) Neu Neg (Human) Accuracy: 100% Pos (AI) 400 0 0 Sensitivity: 100% Neg (Human) 0 0 400 Specificity: 100 % 0 data points were classified as ”No classification”. 23 3. Results 3.2.2 Confidence intervals for the classifications To make sure that the data presented is reliable, confidence intervals of the different accuracies have been calculated. The confidence intervals have been calculated as a proportion of an infinite population and the stocastic variable follows a bionomial distribution. Since each sample size consist of 800 data points, the value of p is estimated using p̂ = X n , where X is representing the correctly classified data points in each cluster, and n is the total amount of data points in each cluster, hence 800. Each confidence interval has then been calculated with the following formula: p̂ ± Zα n √ p̂(1 − p̂) For the confidence intervals on 95%, Zα was set to 1.96, and for the confidence intervals on 99%, Zα was set to 2.58. The maximum span of the confidence intervals in table 3.13 on the 95% confidence level is 0.22 percentage points, and the maximum span of the confidence intervals on the 99% confidence level is 0.28 percentage points. Table 3.13: Confidence intervals for the accuracies of the different classifications made in the main study clustered in groups of Swedish and English (as presented in the confusion matrices). CONFIDENCE LEVELS 95% 99% Classification Accuracy Lower Upper Lower Upper Smodin on Swedish 73.75% 73.64% 73.86% 73.61% 73.89% Smodin on English 94.75% 94.70% 94.80% 94.68% 94.82% Copyleaks on Swedish 94.88% 94.82% 94.93% 94.80% 94.95% Copyleaks on English 100.00% 100.00% 100.00% 100.00% 100.00% The maximum span of the confidence intervals in table 3.14 on the 95% confidence level is 0.22 percentage points, and the maximum span of the confidence intervals on the 99% confidence level is 0.30 percentage points. Table 3.14: Confidence intervals for the accuracies of the different classifications made in the main study clustered in groups of human written articles and AI created texts. CONFIDENCE LEVELS 95% 99% Classification Accuracy Lower Upper Lower Upper Smodin on Human 97.75% 97.71% 97.79% 97.70% 97.80% Smodin on AI 70.75% 70.64% 70.86% 70.60% 70.90% Copyleaks on Human 100.00% 100.00% 100.00% 100.00% 100.00% Copyleaks on AI 94.88% 94.82% 94.93% 94.80% 94.95% 24 4 Discussion 4.1 Discussion All studies being undertaken in this research, in other words, the studies performed on the three AI chatbots and the three AI detectors have only been performed by one person - the author of this report. In this way, it has been assured that there has been no differences in how the, in total, six different studies have been undertaken. Subliminal differences that come undetected when two different persons attempt to do the same work has thereby been avoided when only one person has been carrying out the research. Regarding the main study, the inputs into the two different AI detectors were made in a relative short time span (from the 2th of May 2024 to the 4th of May 2024), mitigating the risk of facing bad results from updates to the AI detectors. This study was conducted in an attempt to understand whether AI could do the job of distinguishing between texts created by AI and texts written by humans, with a focus on Swedish. The results from the pre-study have not been concise, mainly because of differences in how different AI work. However, it is important to consider the results from the main study, and remembering that they are showing significant findings. 4.1.1 The pre-study The findings from the pre-study on Swedish shows that, when asking the three studied AI algorithms Copilot (Microsoft BING), ChatGPT 3.5, and GPT-SW3 (gpt-swe-20b-instruct) none of the AI chatbots presented reliable data. The same finding was the case for the pre-study on English. This confirms what was presented by The Guardian (2023), that a LLM predicts the most likely next word in a sequence of words or sentences and does not automatically understand that it should analyse the presented text. On Swedish, Copilot (Microsoft BING) showed of too low values of accuracy, sensitiv- ity and specificity (all around 60-65%). ChatGPT 3.5 had higher values of accuracy and specificity, but presented values for sensitivity of 0% since it never classified any data point as created by an AI. Lastly, GPT-SW3 (gpt-swe-20b-instruct) also showed of too low values of accuracy, sensitivity and specificity, however with a greater spread than what Copilot (Microsoft BING) showed (spanning from 58% on 25 4. Discussion specificity up to 75% on accuracy). Also, worth noticing when it comes to GPT-SW3 (gpt-swe-20b-instruct) is that it showed of a greater amount of wrongfully classified articles written by humans (11 classified as created by AI) than the total amount of correctly classified data points (which was 3 correctly classified texts created by AI, and 5 correctly classified articles written by humans - a total of 8). On English, Copilot (Microsoft BING) had bland values of accuracy and specificity, and lacked values for sensitivity since it never clearly classified any data point as created by an AI. ChatGPT 3.5 had higher values of accuracy and specificity, but also lacked values for sensitivity. This is since it never clearly classified any data point as created by an AI, instead it marked those as neutral inputs. Lastly, GPT- SW3 (gpt-swe-20b-instruct) also showed of too low values of accuracy, sensitivity and specificity. What this shows is that the pre-study did not give much to hope for the poten- tial performances of the three different AI chatbots. Regardless of whether it was Swedish or English that was studied, the accuracies of the AI chatbots performances where too low, coming as well from bad sensitivities as well from bad specificities. This confirms what was presented in the Introduction chapter, that ”[...] the LLM model does not understand a text as a human does, instead it is a mathematical model that predicts the most likely next word in a sequence of words or sentences.” This means that, regardless of if the AI chatbots have the ability to analyze and classify a text, they do not possess the ability to ”understand” that they should take the part of the prompt containing the text to be analysed and classify it. What that means is that, the way in which the AI chatbots are structured, hinders them from performing the task of classifying a text input. Therefore, the certain types of AI detectors should be used instead. Looking into the performance of the AI detector in the pre-study, different findings were presented on Swedish and on English. What can be concluded from looking at the confusion matrix for that GPTZero marks every Swedish input as an article written by a human. This is most probably because GPTZero is trained on English and only recognizes English words. In other words, what was presented by Ma et al. (2023) could now be used to state that every word in Swedish would show of high perplexity, since every word in every input is a word that GPTZero had not seen before, leading to every input being classified as written by a human. However, when examining the confusion matrix for the Enlgish inputs, GPTZero correctly classifies every input, presenting a better result than what (Habibzadeh, 2023) presented. This shows that the AI is capable of performing the work of distinguishing between human made and AI created texts in English, but it fails at doing so on Swedish. The difference in performance of the AI detector on Swedish and English was an interesting development. This led to a reasoning that, for an AI detector to be able to detect a certain language, the AI detector must be competent in detecting that certain language too. When therefore looking into the AI detector used in the pre-study, it seemed like it could not detect Swedish. Worth noticing too is that, since the AI detector could correctly classify all English texts whilst classifying all Swedish texts as written by humans, it seemed like the AI detector was only looking 26 4. Discussion for AI content. In other words, when the AI detector detected something it did not recognise as being created by AI, it was classified as written by a human. This should also explain why the AI detector could correctly classify the texts in English, whilst failing in Swedish. What that means is that, since the AI detector could not recognized Swedish texts created by an AI, it simply marked it as written by a human. With all this presented, it had come clear that there was no need to study AI chatbots more in this study because of their inability to detect AI content. On the other hand, it was obvious that to be able to study AI detectors properly, the AI detectors must have the ability to detect Swedish language. Therefore, the main study was performed on two AI detectors that claimed to be able to detect both Swedish and English. 4.1.2 The main study The main study focused on the study of articles written by humans and texts created by AI. More specifically 400 articles written by humans in Swedish, 400 articles written by humans in English, 400 texts created by AI in Swedish, and 400 texts created by AI in English. To start of with, the performance of the Smodin AI detector was better than GPTZero on Swedish. However, the performance of the Smodin AI detector was worse than GPTZero on English. This difference in performance on the different languages can be perceived as the result of two different phenomenons. The first one, is about the ability of Smodin to detect Swedish, which was the main problem of GPTZero. What that means is that, the better performance of the Smodin AI detector should mainly be considered a result of the actual ability of Smodin to detect Swedish content, something that GPTZero was incapable of. The second one, is about the increased amount of data points in the main study (800) as compared to the pre-study (32), and how that affects the performance of Smodin on English. That means that, there should be no uncertainties that the slightly lower perfor- mance of Smodin on a few percentage points, as compared to GPTZero, could be regarded as a result of a bigger and more diverse data set. Also, worth mentioning is that, as presented by Elkhatat et al. (2023), development in AI models could make them harder to detect, and since the pre-study was carried out by using ChatGPT 3.5 and the main study used ChatGPT 4, this slight lower performance of Smodin might be due to an increased complexity of ChatGPT 4. When looking into the performances by the Smodin AI detector on Swedish, it is clear that the accuracy performance of 74% is mainly a consequence of wrongful predictions on the actual AI created texts. In other words, the low accuracy is mainly derived from a low sensitivity, since as many as half of the classified AI created texts (195 of 400) were classified as human written articles (false negatives). Taking a look at the Copyleaks AI detector on Swedish, it shows of a sensitivity of 100% as it is never wrongfully classifying any of the articles written by humans. However, every tenth text created by AI were classified as written by a human, which 27 4. Discussion is confirmed by a sensitivity of 90% concluding in an overall accuracy of 95%. Moving forward to the English part of the main study, the performance of the Smodin AI detector is good with high values of accuracy, sensitivity, and specificity. However, the performance of the Smodin AI detector is beaten by the Copyleaks AI detector on English, since the latter shows a constant 100% performance on all three of accuracy, sensitivity, and specificity. As was presented in section 3.2.2 Confidence intervals for the classifications, the span of the confidence intervals on 99% was at maximum 0.30 percentage points. This proves that the accuracies of the different AI detectors should be regarded as very accurate. Moving forward, the performance of the Smodin AI detector has good values in Swedish, and even better in English. Nonetheless, even though it is better perform- ing than GPTZero, it has brought a higher amount of wrongfully classified texts written by humans. This means that, when using Smodin as detector, some texts written by humans will actually be classified as created by AI. This is not a re- quested state, since in a school setting it could be devastating to wrongfully accuse a pupil or a student for having been using AI to create a text, whilst actually having written the text themselves. One has to be extra cautious when testing pupils and student for academic dishonesty (Reuters, 2023), and even more cautious when it comes to higher levels of academic writing (Habibzadeh, 2023). In other words, for the purpose of this study, the aim is to find a solution that has no ”false positives” classifications. Therefore, Smodin AI is not perceived as a reliable option. When looking at Copyleaks AI detector, that certain AI detector is performing bet- ter. Not only does it propose a solution that has no ”false positives” classifications on both English and Swedish, but it also shows the best values of accuracy, sensi- tivity and specificity. In English, there is nothing much to say - the performance of Copyleaks on English is 100% accurate. When looking at the performance of Copyleaks on Swedish, it is important to notice that every tenth text created by AI is wrongfully classified as written by a human. Also, since there are no wrongfully classified articles written by humans, the fact that nine out of ten texts created by an AI will be correctly classified as created by an AI comes as a good performance. This shows that it is possible today for AI to distinguish between texts written by AI and articles written by humans. Worth mentioning is also the fact that Copyleaks performs with a 100% accuracy (99% confidence level) on all articles written by humans. What that means is that, in this research not a single article written by a human was wrongfully classified as being created by an AI. In other words, the Copyleaks AI detector does not wrongfully classify any article written by a human regardless of if it being written in English or in Swedish. 28 5 Conclusion 5.1 Conclusion With all this stated, it is safe to say that AI can do the job of distinguishing between texts created by AI and texts written by humans. However, it must be emphasised that it is not possible to ask an AI chatbot if a text is created by an AI or written by a human and expect an honest answer. On the other hand side, what is possible is for AI detectors to distinguishing between texts created by AI and texts written by humans. On English, the best performing AI detector showed an accuracy of 100%, and on Swedish, the best performing AI detector showed an accuracy of hardly 95%. The performance on English is clearly perfect, and whilst the performance on Swedish is lower, it is still good. What can be said is that the specificity for the best performing AI detector on Swedish was 100%, meaning that no texts written by humans were wrongfully clas- sified as created by AI. However, the sensitivity of 90% on the same AI detector confirms that every tenth text created by an AI will wrongfully be classified as writ- ten by a human. Thus, one could claim that AI can do the job of distinguishing between texts created by AI and texts written by humans, where it is important that texts written by humans are not wrongfully classified, but where some sort of wrongful classification of texts created by AI is accepted. This could be in a setting like the Swedish education system, where the results from the AI detection tool could be used for disciplinary actions. The conclusion is thereby that, if you want the AI detector to detect Swedish texts created by AI and accept that every tenth of those will be wrongfully classified as written by a human, and as well not wanting to wrongfully classify texts written by humans, today an AI can do the job for you. With that said, since AI is constantly evolving and getting more delicate, one should still be careful with what results to expect for other texts created by AI on a latter day than today. 29 5. Conclusion 5.2 Limitations This study has only been made with text created by one AI, merely ChatGPT (in the pre-study version 3.5, and for the main study version 4). This study cannot therefore be used to say anything about the performance of detecting texts created by any other AI. Also, as this study is presented, Openai has just released GPT-4o, which Openai (2024) themselves describe as ”[...] our new flagship model that can reason across audio, vision, and text in real time.” There is nothing in this study that can confirm whether this GPT-4o could be detected by an AI, therefor one should act cautious when wanting to detect texts created by AI already tomorrow. It is also worth mentioning that the AI detectors studied in the main study might well improve, as well as decline, in performance as the development of generative AI proceeds. With that said, it is important to remember that the performance of the AI detectors studied in this research might be totally different tomorrow. Another thing to be cautious about, is that the study has been focused around articles. The texts written by humans in this study were articles of different kinds, and the texts created by AI in this study (ChatGPT 4) were explicitly prompted to look like articles. In that way, it is difficult to draw any secure conclusions about what these findings would imply in a school setting, since the texts written in schools differs from what an article in a newspaper looks like. Finally, it is important to mention that this study has been presenting texts purely written by humans and purely created by AI. This means, that the conclusions regarding the findings in this study is only applicable for texts of these kinds. If a text would have been created by an AI and thereafter being manipulated by a human, or to say: if a text was to be 50/50 written by a human and created by an AI, then this study cannot be used to conclude anything about the ability of an AI detector to correctly classify those kinds of texts. 5.3 Ethics In the setting of classifying the creation of a human, the ethical aspects of wrongfully classifying a text written by a human as a text created by an AI could cause negative consequences for the individual. If the classification would be the foundation for a disciplinary action, as an academic suspension, one must be sure that the basis for the decision is correct. Hence, one should still be cautious and careful about the usage of these kinds of AI detectors when using them for, for example, disciplinary reasons. From another perspective, it is also important to remember that what has been said about Smodin and Copyleaks in this research is only applicable for those AI detectors. That means, when using those AI detectors after they have been updated, or when using another AI detector, one should not be too confident in relying on their classifications. 30 5. Conclusion It is also important to remember that some people, through their personal way of writing, will write more AI-like than other people. This means, that if a human writes texts similarly to how an AI would have written it, it could be the sole reason to why a text written by that person would be classified as created by an AI instead of written by a human. 5.4 Further studies As was presented in the previous section, this study has been presenting texts purely written by humans and purely created by AI. Therefore, it would have been inter- esting to see further studies on texts partially created by AI and partially written by humans. Also, it would have been interesting to see an extensive research on the ability of AI detectors to detect content created by AI in the different Nordic languages. It would have been interesting to understand if there are subliminal similarities or differences in the languages that makes an AI detector trained on, for example, Swedish also able to detect AI content in, for example, Icelandic. 31 5. Conclusion ”It’s murder on the dancefloor.” hummed Oliver Quick as he glanced over to Mr. Ripley. 32 Bibliography Ayoobi, N., Shahriar, S., & Mukherjee, A. (2023). The looming threat of fake and llm-generated linkedin profiles: Challenges and opportunities for detection and prevention. Proceedings of the 34th ACM Conference on Hypertext and Social Media. https://doi.org/10.1145/3603163.3609064 Beyond Identity. (2023). The future of hacking is ai. Retrieved September 24, 2024, from https : / / www . beyondidentity. com / reports - guides / the - future - of - hacking-is-ai Bisi, T., Risser, A., Clavert, P., Migaud, H., & Dartus, J. (2023). What is the rate of text generated by artificial intelligence over a year of publication in orthope- dics traumatology: Surgery research? analysis of 425 articles before versus after the launch of chatgpt in november 2022. Orthopaedics Traumatology: Surgery Research, 109 (8), 103694. https://doi.org/https://doi.org/10.1016/ j.otsr.2023.103694 Bushard, B. (2023). Fake scientific abstracts written by chatgpt fooled scientists, study finds. Retrieved September 24, 2024, from https://www.forbes.com/ sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt- fooled-scientists-study-finds/ Chaka, C. (2023). Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools. Journal of Applied Learning and Teaching, 6 (2). Cingillioglu, I. (2023). Detecting ai-generated essays: The chatgpt challenge. The International Journal of Information and Learning Technology, 40 (3), 259– 268. Dahlgren, P. M. (n.d.). News articles and front pages from 19 swedish news sites during the covid-19/corona pandemic 20202021. https://doi.org/https://doi. org/10.5878/d18f-q220 Elkhatat, A. M., Elsaid, K., & Almeer, S. (2023). Evaluating the efficacy of ai content detection tools in differentiating between human and ai-generated text. International Journal for Educational Integrity, 19 (1), 17. Habibzadeh, F. (2023). Gptzero performance in identifying artificial intelligence- generated medical texts: A preliminary study. Journal of Korean Medical Science, 38 (38). Hitsuwari, J., Ueda, Y., Yun, W., & Nomura, M. (2023). Does human–ai collabo- ration lead to more creative art? aesthetic evaluation of human-made and ai-generated haiku poetry. Computers in Human Behavior, 139, 107502. 33 https://doi.org/10.1145/3603163.3609064 https://www.beyondidentity.com/reports-guides/the-future-of-hacking-is-ai https://www.beyondidentity.com/reports-guides/the-future-of-hacking-is-ai https://doi.org/https://doi.org/10.1016/j.otsr.2023.103694 https://doi.org/https://doi.org/10.1016/j.otsr.2023.103694 https://www.forbes.com/sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-fooled-scientists-study-finds/ https://www.forbes.com/sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-fooled-scientists-study-finds/ https://www.forbes.com/sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-fooled-scientists-study-finds/ https://doi.org/https://doi.org/10.5878/d18f-q220 https://doi.org/https://doi.org/10.5878/d18f-q220 Bibliography Jiang, B., Tan, Z., Nirmal, A., & Liu, H. (2024). Disinformation detection: An evolv- ing challenge in the age of llms. In Proceedings of the 2024 siam international conference on data mining (sdm) (pp. 427–435). https://doi.org/10.1137/1. 9781611978032.50 King, T. C., Aggarwal, N., Taddeo, M., & Floridi, L. (2020). Artificial intelligence crime: An interdisciplinary analysis of foreseeable threats and solutions. Sci- ence and engineering ethics, 26, 89–120. Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A watermark for large language models. International Conference on Machine Learning, 17061–17084. Kirchenbauer, J., Geiping, J., Wen, Y., Shu, M., Saifullah, K., Kong, K., Fernando, K., Saha, A., Goldblum, M., & Goldstein, T. (2023). On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634. Liang, W., Izzo, Z., Zhang, Y., Lepp, H., Cao, H., Zhao, X., Chen, L., Ye, H., Liu, S., Huang, Z., et al. (2024). Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews. arXiv preprint arXiv:2403.07183. Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., & Liu, X. (2023). Ai vs. human–differentiation analysis of scientific content generation. arXiv, 2301. Mitra, A., Mohanty, S. P., & Kougianos, E. (2024). The world of generative ai: Deepfakes and large language models. arXiv preprint arXiv:2402.04373. Openai. (2024). Hello gpt-4o. Retrieved May 20, 2024, from https://openai.com/ index/hello-gpt-4o/ Partadiredja, R. A., Serrano, C. E., & Ljubenkov, D. (2020). Ai or human: The socio- ethical implications of ai-generated media content. 2020 13th CMI Conference on Cybersecurity and Privacy (CMI)-Digital Transformation-Potentials and Challenges (51275), 1–6. Reuters. (2023). Chatgpt sets record for fastest-growing user base - analyst note. Retrieved May 16, 2024, from https://www.reuters.com/technology/chatgpt- sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ Skolverket. (2023). Ai i skolan risker och möjligheter. Retrieved May 15, 2024, from https://www.skolverket.se/om-oss/aktuellt/nyheter/nyheter/2023-09-05-ai- i-skolan---risker-och-mojligheter Stephen M. Walker II. (n.d.). Perplexity in ai and nlp. Retrieved May 15, 2024, from https://klu.ai/glossary/perplexity Sun, Y., He, J., Cui, L., Lei, S., & Lu, C.-T. (2024). Exploring the deceptive power of llm-generated fake news: A study of real-world detection challenges. arXiv preprint arXiv:2403.18249. SVT Nyheter. (2020). Från chockgranater och laxkanoner till coronakris – här är nyhetsåret 2020. Retrieved February 22, 2024, from https ://www.svt .se/ nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris- har-ar-nyhetsaret-2020 SVT Nyheter. (2023a). Lärare riktar kritik mot skolverket ser risk för ai-fusk: na- tionella proven borde stoppats. Retrieved May 15, 2024, from https://www. svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk- for-ai-fusk 34 https://doi.org/10.1137/1.9781611978032.50 https://doi.org/10.1137/1.9781611978032.50 https://openai.com/index/hello-gpt-4o/ https://openai.com/index/hello-gpt-4o/ https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ https://www.skolverket.se/om-oss/aktuellt/nyheter/nyheter/2023-09-05-ai-i-skolan---risker-och-mojligheter https://www.skolverket.se/om-oss/aktuellt/nyheter/nyheter/2023-09-05-ai-i-skolan---risker-och-mojligheter https://klu.ai/glossary/perplexity https://www.svt.se/nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-har-ar-nyhetsaret-2020 https://www.svt.se/nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-har-ar-nyhetsaret-2020 https://www.svt.se/nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-har-ar-nyhetsaret-2020 https://www.svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-for-ai-fusk https://www.svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-for-ai-fusk https://www.svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-for-ai-fusk Bibliography SVT Nyheter. (2023b). Ny rapport: Många elever använder ai för att fuska i skolan. Retrieved May 15, 2024, from https : //www.svt . se/nyheter/ inrikes/ny - rapport-manga-elever-anvander-ai-for-att-fuska-i-skolan The Guardian. (2023). How ai chatbots like chatgpt or bard work visual explainer. Retrieved May 16, 2024, from https://www.theguardian.com/technology/ng- interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual- explainer Thompson, A. (n.d.). All the news 2.0 – 2.7 million news articles and essays from 27 american publications. https://components.one/datasets/all-the-news-2- news-articles-dataset Weichert, J., & Dimobi, C. (2024). Dupe: Detection undermining via prompt engi- neering for deepfake text. arXiv preprint arXiv:2404.11408. 35 https://www.svt.se/nyheter/inrikes/ny-rapport-manga-elever-anvander-ai-for-att-fuska-i-skolan https://www.svt.se/nyheter/inrikes/ny-rapport-manga-elever-anvander-ai-for-att-fuska-i-skolan https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer https://components.one/datasets/all-the-news-2-news-articles-dataset https://components.one/datasets/all-the-news-2-news-articles-dataset Bibliography 36 A The Swedish pre-study In this appendix the research from the so called Swedish pre-study is presented. On page II the part depicting the articles written by humans at SVT Nyheter starts. On page VI the part depicting the texts created by AI through ChatGPT 3.5 starts. The articles written by humans at SVT Nyheter are presented with the following information: • The order of the article in the sequence. • The index for the article (H1-H16 ) where H stands for Human. • The date when the article was published. • The latest date when the article was updated. • The date when the article was retrieved. • The source to the article. The texts created by AI through ChatGPT 3.5 are presented with the following information: • The order of the text in the sequence. • The index for the text (A1-A16 ) where A stands for Artificial Intelligence. • The index for the article written by a human (H1-H16 ) that the text is pooled to. • The date when the AI was prompted and the text output was retrieved. • The prompt that was put into the AI. • The actual text. Lastly, on page XVIII, the order of the 32 data points in the collated data set are presented, and on page XIX, the classifications made by the different AI algorithms on the 32 data points are presented. I A. The Swedish pre-study Below are the articles written by humans at SVT Nyheter presented in plain text as they were put in the AI-algorithms. H1 First article written by a human Index Published Updated Retrieved H1 2020-08-18 2020-08-18 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/har-sprangs-misstankt a-granaten-i-karlstad H2 Second article written by a human Index Published Updated Retrieved H2 2020-09-05 2020-09-07 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/misstankt-farligt-for emal-vid-polishuset-i-kristinehamn H3 Third article written by a human Index Published Updated Retrieved H3 2020-11-05 2020-11-20 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/personalbrist-pa-forlo ssningen-trots-standiga-rekryteringsforsok H4 Fourth article written by a human Index Published Updated Retrieved H4 2020-08-14 2020-08-19 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/kalle-91-jag-flyttar-i nte-da-far-de-kora-mig-harifran-i-en-likkista H5 Fifth article written by a human Index Published Updated Retrieved H5 2020-09-24 2020-09-24 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/granbarkborren-orsak ade-skador-for-over-sju-miljarder II https://www.svt.se/nyheter/lokalt/varmland/har-sprangs-misstankta-granaten-i-karlstad https://www.svt.se/nyheter/lokalt/varmland/har-sprangs-misstankta-granaten-i-karlstad https://www.svt.se/nyheter/lokalt/varmland/misstankt-farligt-foremal-vid-polishuset-i-kristinehamn https://www.svt.se/nyheter/lokalt/varmland/misstankt-farligt-foremal-vid-polishuset-i-kristinehamn https://www.svt.se/nyheter/lokalt/varmland/personalbrist-pa-forlossningen-trots-standiga-rekryteringsforsok https://www.svt.se/nyheter/lokalt/varmland/personalbrist-pa-forlossningen-trots-standiga-rekryteringsforsok https://www.svt.se/nyheter/lokalt/varmland/kalle-91-jag-flyttar-inte-da-far-de-kora-mig-harifran-i-en-likkista https://www.svt.se/nyheter/lokalt/varmland/kalle-91-jag-flyttar-inte-da-far-de-kora-mig-harifran-i-en-likkista https://www.svt.se/nyheter/lokalt/varmland/granbarkborren-orsakade-skador-for-over-sju-miljarder https://www.svt.se/nyheter/lokalt/varmland/granbarkborren-orsakade-skador-for-over-sju-miljarder A. The Swedish pre-study H6 Sixth article written by a human Index Published Updated Retrieved H6 2020-09-07 2020-09-08 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/carolas-son-ett-av-nar kotikans-offer H7 Seventh article written by a human Index Published Updated Retrieved H7 2020-12-15 2024-02-12 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/klart-svenska-rallyt-s talls-in H8 Eighth article written by a human Index Published Updated Retrieved H8 2020-10-06 2020-10-06 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/sveriges-forsta-laxka non-i-drift-i-forshaga H9 Ninth article written by a human Index Published Updated Retrieved H9 2020-03-15 2020-03-15 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/smittskyddslakaren-v iktigt-att-begransa-hur-vi-umgas H10 Tenth article written by a human Index Published Updated Retrieved H10 2020-04-21 2020-04-22 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/storforspolitiker-akt e-till-stockholm-for-att-fa-corona H11 Eleventh article written by a human III https://www.svt.se/nyheter/lokalt/varmland/carolas-son-ett-av-narkotikans-offer https://www.svt.se/nyheter/lokalt/varmland/carolas-son-ett-av-narkotikans-offer https://www.svt.se/nyheter/lokalt/varmland/klart-svenska-rallyt-stalls-in https://www.svt.se/nyheter/lokalt/varmland/klart-svenska-rallyt-stalls-in https://www.svt.se/nyheter/lokalt/varmland/sveriges-forsta-laxkanon-i-drift-i-forshaga https://www.svt.se/nyheter/lokalt/varmland/sveriges-forsta-laxkanon-i-drift-i-forshaga https://www.svt.se/nyheter/lokalt/varmland/smittskyddslakaren-viktigt-att-begransa-hur-vi-umgas https://www.svt.se/nyheter/lokalt/varmland/smittskyddslakaren-viktigt-att-begransa-hur-vi-umgas https://www.svt.se/nyheter/lokalt/varmland/storforspolitiker-akte-till-stockholm-for-att-fa-corona https://www.svt.se/nyheter/lokalt/varmland/storforspolitiker-akte-till-stockholm-for-att-fa-corona A. The Swedish pre-study Index Published Updated Retrieved H11 2020-03-16 2020-03-23 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/det-ar-nastan-varre-a n-under-kriget H12 Twelfth article written by a human Index Published Updated Retrieved H12 2020-04-08 2020-04-08 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/inga-norrman-i-varml and-i-pask-tomt-i-stugor-och-husvagnar H13 Thirteenth article written by a human Index Published Updated Retrieved H13 2020-04-15 2020-04-15 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/mot-personalen-som-v ardar-coronapatienter H14 Fourteenth article written by a human Index Published Updated Retrieved H14 2020-09-15 2020-09-15 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/isoleringen-borjar-kan nas-trostlos-for-vara-aldre H15 Fifteenth article written by a human Index Published Updated Retrieved H15 2020-06-05 2020-06-05 2024-02-22 Source: https://www.svt.se/nyheter/lokalt/varmland/lang-vag-till-aterham tning-for-johan-efter-covid-19 H16 Sixteenth article written by a human Index Published Updated Retrieved H16 2020-12-06 2020-12-07 2024-02-22 IV https://www.svt.se/nyheter/lokalt/varmland/det-ar-nastan-varre-an-under-kriget https://www.svt.se/nyheter/lokalt/varmland/det-ar-nastan-varre-an-under-kriget https://www.svt.se/nyheter/lokalt/varmland/inga-norrman-i-varmland-i-pask-tomt-i-stugor-och-husvagnar https://www.svt.se/nyheter/lokalt/varmland/inga-norrman-i-varmland-i-pask-tomt-i-stugor-och-husvagnar https://www.svt.se/nyheter/lokalt/varmland/mot-personalen-som-vardar-coronapatienter https://www.svt.se/nyheter/lokalt/varmland/mot-personalen-som-vardar-coronapatienter https://www.svt.se/nyheter/lokalt/varmland/isoleringen-borjar-kannas-trostlos-for-vara-aldre https://www.svt.se/nyheter/lokalt/varmland/isoleringen-borjar-kannas-trostlos-for-vara-aldre https://www.svt.se/nyheter/lokalt/varmland/lang-vag-till-aterhamtning-for-johan-efter-covid-19 https://www.svt.se/nyheter/lokalt/varmland/lang-vag-till-aterhamtning-for-johan-efter-covid-19 A. The Swedish pre-study Source: https://www.svt.se/nyheter/lokalt/varmland/efter-avgorande-mat chen-gladje-lattnad-det-ar-allting V https://www.svt.se/nyheter/lokalt/varmland/efter-avgorande-matchen-gladje-lattnad-det-ar-allting https://www.svt.se/nyheter/lokalt/varmland/efter-avgorande-matchen-gladje-lattnad-det-ar-allting A. The Swedish pre-study Below are the texts created by AI through ChatGPT 3.5 presented in plain text as they were put in the AI-algorithms. A1 First text created by an AI Index Pooled to Retrieved A1 H1 2024-02-23 Prompt: Skriv en kort nyhetsartikel med rubriken ”Här sprängs misstänkta granaten i Karlstad”. Här sprängs misstänkta granaten i Karlstad Under tidiga morgontimmar idag inträffade en kraftig explosion i centrala Karlstad, vilket sände chockvågor genom det lugna samhället. Den misstänkta explosionen, som polisen bekräftar vara relaterad till en granat, skedde vid en park i närheten av ett bostadsområde. Polisen uppger att ingen skadades i explosionen, men skador på omkringlig- gande egendom har noterats. Räddningstjänsten och polisen arbetar nu på platsen för att säkra området och utreda händelsen. En talesperson för polisen uppmanar allmänheten att vara försiktig och rap- portera alla misstänkta föremål till myndigheterna omedelbart. Det finns ännu inga uppgifter om motivet bakom explosionen, men en utredning är nu igång för att fastställa omständigheterna kring händelsen. Invånarna i Karlstad uppmanas att vara vaksamma och samarbeta med myn- digheterna för att säkerställa tryggheten i staden. Polisen kommer att fort- sätta sin närvaro i området under dagen medan utredningen pågår. A2 Second text created by an AI Index Pooled to Retrieved A2 H2 2024-02-23 Prompt: Skriv en kort nyhetsartikel med rubriken ”Misstänkt farligt föremål vid polishuset i Kristinehamn”. Misstänkt farligt föremål vid polishuset i Kristinehamn Under tisdagsmorgonen upptäcktes ett misstänkt farligt föremål utanför pol- ishuset i Kristinehamn. Polisen har inlett en utredning och avspärrningar har satts upp runt området för att säkerställa allmänhetens säkerhet. VI A. The Swedish pre-study Enligt initiala rapporter ska föremålet ha påträffats av en polispatrull som genast vidtog säkerhetsåtgärder och larmade specialister för att undersöka det misstänkta föremålet närmare. Detaljer kring föremålets karaktär eller eventuellt motiv är ännu inte tillgäng- liga då polisen fortsätter med utredningen. Invånare uppmanas att undvika området kring polishuset tills situationen är under kontroll. Polisen betonar vikten av allmänhetens samarbete och uppmanar personer som har information om händelsen att kontakta dem omedelbart. Fler uppdateringar förväntas ges allteftersom utredningen fortskrider och sit- uationen utvecklas. A3 Third text created by an AI Index Pooled to Retrieved A3 H3 2024-02-23 Prompt: Skriv en kort nyhetsartikel med rubriken ”Personalbrist på förlossningen trots ständiga rekryteringsförsök”. Personalbrist på förlossningen trots ständiga rekryteringsförsök Trots ihärdiga rekryteringsinsatser kämpar förlossningsavdelningar runt om i landet med en akut brist på personal. Den oroande situationen har fått personalstyrkan att nå kritiska nivåer, vilket innebär ökad arbetsbelastning för befintlig personal och försämrade vårdstandarder för förlossande kvinnor. Enligt ledningen för flera förlossningsavdelningar har rekryteringsprocesserna varit kontinuerliga och omfattande. Trots detta har det varit svårt att fylla vakanserna. En av anledningarna till den fortsatta bristen på personal tros vara den höga arbetsbelastningen och de stressfyllda arbetsförhållandena som förknippas med yrket. ”Vi har gjort allt vi kan för att locka till oss kvalificerad personal, inklusive annonsering, erbjudanden om bonusar och utbildningsincitament. Tyvärr har vi ännu inte kunnat täcka behoven”, säger Anna Andersson, sjukhuschef vid en av förlossningsavdelningarna. Konsekve