AI-Wolf in Sheep’s Clothing

Distinguishing between Swedish humans and AI wannabes

Master’s thesis in Computer science and engineering

ADAM LANDBERG

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

AI-Wolf in Sheep’s Clothing

Distinguishing between Swedish humans and AI wannabes

ADAM LANDBERG

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


AI-Wolf in Sheep’s Clothing
Distinguishing between Swedish humans and AI wannabes
ADAM LANDBERG

© ADAM LANDBERG, 2024.

Supervisor: Paweł W. Woźniak, Department of Computer Science and Engineering
Advisor: Kristina Knaving, RISE Research Institutes of Sweden
Examiner: Marina Axelson-Fisk, Department of Mathematical Sciences

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: An AI-Wolf in Sheep’s Clothing

Gothenburg, Sweden 2024

iv


AI-Wolf in Sheep’s Clothing
Distinguishing between Swedish humans and AI wannabes
ADAM LANDBERG
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
This report is investigating whether it is possible for Artificial Intelligence (AI)
chatbots and AI detectors to detect texts created by AI in Swedish and in English.
Mainly focusing on texts created by the AI chatbot ChatGPT 4, the performances
of the AI detectors Smodin and Copyleaks are investigated. The research is moti-
vated by a scarce previous research about AI in Swedish and an articulated need
from Swedish schools to understand if it is possible to detect AI created content in
homework and essays. By prompting the AI detectors with 400 articles written by
humans in Swedish, 400 articles written by humans in English, 400 texts created
by AI in Swedish, and 400 texts created by AI in English the AI detectors were
thorogly examined. What could be proven was that Smodin showed an accuracy
between 73.61% - 73.89% on a 99% confidence level on Swedish content, and an
accuracy between 94.68% - 94.82% on a 99% confidence level on English content.
On the other hand, Copyleaks showed an accuracy between 94.80% - 94.95% on a
99% confidence level on Swedish content, and an accuracy between 100% on a 99%
confidence level on English content. This indicates that, whilst always sucessfully
detecting texts written by humans, it is possible to detect Swedish text created by
AI nine out of ten times.

Keywords: Artificial Intelligence, AI, AI chatbots, AI detectors

v


Acknowledgements
Firstly, I would like to express my warmest thank you to my fantastic, free thinking
advisor Kristina Knaving for supporting me throughout the whole work. Her way
of encouraging me and believing in me was exactly what I needed for this to work
out. To be given a bun and a coffee, or a warming hug when so was needed, is what
made the journey to finalise this work an easy ride. I am also thankful for being a
part of your curios thinking, and I have learned much from your reflections.
With that said: Thank you!

Secondly, I would like to express my greatest thank you to my fascinating supervisor
Paweł W. Woźniak for being there as my help in the work of making this research
come true. I remember the time we were discussing solutions on how to gather the
data sets of the main study, and how you suggested to structure the work.
As it goes: Thank you!

Finally, I would like to express my most humble thank you to my solution oriented
examiner Marina Axelson-Fisk for helping me when things were unclear. I am espe-
cially thankful for how you helped me with the calculation of the confidence intervals
that can be found in section 3.2.2 Confidence intervals for the classifications.
So to you: Thank you!

I would also like to express my gratitude to the many more people that have reflected
with me, that have held me up, that have pushed me to my limit, and, of course,
celebrated with me!
So to all of you: Thank you!
Ingen nämnd - ingen glömd.

Kärlek och respekt,

Adam Landberg, Gothenburg, 2024-06-19

vii


AI declaration
The author confirms that AI was not used for the present report other than where
so is stated.

Adam Landberg, Gothenburg, 2024-06-19

ix


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 The problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is generative AI? . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 How is generative AI used? . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Problematic usage . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Counteracting problematic usage . . . . . . . . . . . . . . . . 4

1.4 The Swedish setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Method 9
2.1 The pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 The Swedish part of the pre-study . . . . . . . . . . . . . . . . 10
2.1.1.1 The performance of the AI chatbots in Swedish . . . 10
2.1.1.2 The performance of the AI detector in Swedsih . . . 10
2.1.1.3 The Swedish data of the pre-study . . . . . . . . . . 11

2.1.2 The English part of the pre-study . . . . . . . . . . . . . . . . 11
2.1.2.1 The performance of the AI chatbots in English . . . 12
2.1.2.2 The performance of the AI detector in English . . . . 12
2.1.2.3 The English data of the pre-study . . . . . . . . . . . 13

2.2 The main study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 The Swedish data of the main study . . . . . . . . . . . . . . 14
2.2.2 The English data of the main study . . . . . . . . . . . . . . . 15

3 Results 17
3.1 The pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 The Swedish part of the pre-study . . . . . . . . . . . . . . . . 17
3.1.1.1 The performance of the AI chatbots in Swedish . . . 17
3.1.1.2 The performance of the AI detector in Swedish . . . 18

3.1.2 The English part of the pre-study . . . . . . . . . . . . . . . . 19
3.1.2.1 The performance of the AI chatbots in English . . . 20
3.1.2.2 The performance of the AI detector in English . . . . 21

3.2 The main study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

xi


Contents

3.2.1 The classifications . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Confidence intervals for the classifications . . . . . . . . . . . 24

4 Discussion 25
4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 The pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 The main study . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Conclusion 29
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Further studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 33

A The Swedish pre-study I

B The English pre-study XXI

C The Swedish main study XLIII

D The English main study DLXXI

E The summarised result from the main study MCCVII

xii


List of Figures

3.1 GPTZero probability span when classifying Swedish texts in the pre-
study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Box plot of GPTZero probability span when classifying Swedish texts
in the pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 GPTZero probability span when classifying English texts in the pre-study 21
3.4 Box plot of GPTZero probability span when classifying English texts

in the pre-study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

xiii


List of Figures

xiv


List of Tables

3.1 Confusion matrix of classifications made by Copilot (Copilot (Mi-
crosoft BING) when classifying Swedish texts in the pre-study . . . . . 17

3.2 Confusion matrix of classifications made by ChatGPT 3.5 when clas-
sifying Swedish texts in the pre-study . . . . . . . . . . . . . . . . . . 18

3.3 Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b-
instruct) when classifying Swedish texts in the pre-study . . . . . . . . 18

3.4 Confusion matrix of classifications made by GPTZero when classify-
ing Swedish texts in the pre-study . . . . . . . . . . . . . . . . . . . . 18

3.5 Confusion matrix of classifications made by Copilot (Copilot (Mi-
crosoft BING) when classifying English texts in the pre-study . . . . . 20

3.6 Confusion matrix of classifications made by ChatGPT 3.5 when clas-
sifying English texts in the pre-study . . . . . . . . . . . . . . . . . . 20

3.7 Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b-
instruct) when classifying English texts in the pre-study . . . . . . . . 20

3.8 Confusion matrix of classifications made by GPTZero when classify-
ing English texts in the pre-study . . . . . . . . . . . . . . . . . . . . 21

3.9 Classification by Smodin on Swedish texts in the main study . . . . . 23
3.10 Classification by Smodin on English texts in the main study . . . . . . 23
3.11 Classification by Copyleaks on Swedish texts in the main study . . . . 23
3.12 Classification by Copyleaks on the English texts in the main study . . 23
3.13 Confidence intervals for the accuracies of the different classifications

made in the main study clustered in groups of Swedish and English
(as presented in the confusion matrices). . . . . . . . . . . . . . . . . 24

3.14 Confidence intervals for the accuracies of the different classifications
made in the main study clustered in groups of human written articles
and AI created texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

xv


List of Tables

xvi


1
Introduction

1.1 The problem statement
The 30th of November 2022 marks the date when the AI chatbot ChatGPT was
introduced on the internet. This happening sparked a fire all over the world because
of its ability to respond like a human. Ever since that day, more and more AI
chatbots have popped up on the internet, some better performing than others. Like
the corona pandemic, this widespread availability of AI chatbots has dramatically
shifted the way humans live their everyday life. One can say, that as easy it was for
humans to adapt to a remote way of working during the corona pandemic, as easy
it has been to use and benefit from AI chatbots.

As the availability of AI chatbots increase, the amount of texts created by AI chat-
bots will increase too. As the performance of AI chatbots improves, it will get more
and more difficult to distinguish texts created by AI from texts written by humans.
And, as the usage of AI chatbots intensifies, the everyday human might start to
think that texts created by AI actually are written by humans.

The phenomenon of mistaking texts created by AI for texts written by humans, was
studied by Hitsuwari et al. In 2023 they published a report named ”Does human-AI
collaboration lead to more creative art? Aesthetic evaluation of human-made and
AI-generated haiku poetry” in which they studied if it was possible for humans to
distinguish between human made and AI made haiku (a Japanese poem of seventeen
syllables). As to confirm the statement in the previous paragraph, Hitsuwari et al.
could prove that it was impossible for humans to distinguish between haikus created
by AI and haikus written by humans (Hitsuwari et al., 2023).

The human inability to distinguish between longer texts created by AI and longer
texts written by humans, had previously been studied by Partadiredja et al. (2020).
They had found out that the ability of people to distinguishing between human
made texts and AI-made text was 54%. In other words, the ability of humans to
distinguish text created by AI from texts written by humans was proven not to be
any better than the flipping of a coin.

With the previously presented information at hand, one might ask if it is possible to
develop an AI that could aid in the work of doing such a classification that humans
fail to do. To rephrase, if humans fail to distinguish between texts created by AI
and texts written by humans, could an AI do the job instead?

1


1. Introduction

1.2 What is generative AI?
To start off, it is necessary to understand what is meant by an AI being able to
do the job. In this case, what will be referred to as AI is what otherwise is called
generative AI. It is important to understand that generative AI is not a digital copy
of a brain, even though it might sometimes look like one. In the article ”How AI
chatbots like ChatGPT or Bard work visual explainer” (The Guardian, 2023), the
function of generative AI is explained as being inspired by how human brains work,
with emphasis on inspired. Since it is fundamental to understand the function of
generative AI to be able to follow the reasoning in this thesis, an explanation will
follow below.

The kind of generative AI that produce texts are usually referred to as Large Lan-
guage Models (LLM ). These LLMs consists of connections and nodes that connect
words, or parts of words, with each other. These connections between different nodes
have different distances. Those distances are fundamental for the LLM because, as
the distance between two nodes becomes smaller the probability of one of the nodes
following the other one increases. (The Guardian, 2023)

More technically speaking, each node (i.e. word or part of a word) is saved in the
model as a token, and the distance between different nodes are saved as vectors.
In this way, by using the tokens and the vectors, the nodes can be mapped in a
multidimensional space. In this space, the distance between different nodes are
what makes it possible for the LLM to predict the next node (i.e. word or part of a
word) in a sequence. (The Guardian, 2023)

With what has been previously stated, it is appropriate to make it clear that an LLM
uses mathematical processes to predict the next token in a string of text. However,
the LLM does not only look at the last node in the relevant string to predict the next
one. Instead, the LLM uses other nodes in the string to understand the context. To
explain what that means, it is time to look at an example taken from The Guardian
(2023):

Let us consider the following situation where you are asking the LLM two questions:
Question: What is a tidy thing to eat pasta with?
Answer: A tidy thing to eat pasta with is a fork.

and
Question: What is a nice thing to eat pasta with?
Answer: A nice thing to eat pasta with is pesto.

This example has been taken from The Guardian (2023)

Notice how the LLM model recognizes the differences in the sentences even though
the words that make the difference were not the last words in any of the text strings!
This is because the LLM takes all words it recognizes in the question text string,
weights them with each other, and put certain emphasis on the words eat, pasta,
tidy, and nice to make the answers appropriate. In the same way, the word with
will be reweighed depending on the question asked. For the question containing the

2


1. Introduction

word tidy the word with will most likely be associated with words like ”using” and
”by means of”, whilst for the question containing the word nice the word with will
most likely be associated with words like ”accompanied by”. (The Guardian, 2023)

With this said, the important takeaway is to understand that the LLM does not
understand a text as a human do, instead it is a mathematical model that predicts
the most likely next word in a sequence of words or sentences.

1.3 How is generative AI used?
With the LLM explained, it is necessary to understand how generative AI is used
today. One of the most well known generative AI is called ChatGPT, which is an
AI chatbot that was released in November 2022. In January 2023, Reuters (2023)
reported that ChatGPT had been used by 100 million users that certain month only.
If that was not enough, in the same month, ChatGPT had had 13 million unique
users per day (Reuters, 2023). This increased usage of AI chatbots had an effect in
many parts of society, including academia.

In the same reporting as was previously mentioned, Reuters made an indicating
statement presenting ”questions about facilitation of academic dishonesty and mis-
information” (Reuters, 2023). The reporting from Reuters was not the only one
introducing question about AI and academia. On the 10th of January 2023, Forbes
staff Bushard (2023) reported that ”[f]ake scientific abstracts and research papers
generated using OpenAIs highly-advanced chatbox ChatGPT fooled scientists into
thinking they were real reports nearly one-third of the time, [...]”. Half a year after
Bushard’s reporting, Bisi et al. (2023) presented a study on how much AI created
content 425 articles, published before and after the launch of ChatGPT, contained.
The research ”[...] showed an increase in AI generation for complete articles [...]
and for the body of articles [...]”. The authors also argue that AI content had been
present already before the wide spread of ChatGPT, but their findings could how-
ever confirm that the presence of AI content had significantly increased over time
nonetheless. (Bisi et al., 2023)

1.3.1 Problematic usage
As earlier presented , the usage of generative AI is high and widely spread in society.
However, what has not yet been covered is the potential impact the availability of AI
has had on the society. Therefore, it is time to look at an article named ”The World
of Generative AI: Deepfakes and Large Language Models” by Mitra et al. (2024). In
this article, the authors are covering the harmful ways in which LLMs can be used.
More specifically, the authors are investigating the topic of deepfakes as they are
made possible through the use of LLMs. The authors state that the mere existence
of deepfakes is not problematic, however they claim that, as it gets attention by
the global hacking community, the risk of AI-crimes increasing significantly seems
as given, rather than only a possibility. In other words, it should be expected that
AI-crimes like theft, fraud, forgery and personation (King et al., 2020) will increase
as a result of the wide spread and wide usage of generative AI.

3


1. Introduction

Another aspect of problematic usage of LLMs, is the spreading of fake news. This
phenomenon was studied by Sun et al. (2024) who tested real news articles against
fake news articles on humans and AI detecting algorithms. More specifically, they
tested fake news, created by prompting different AI algorithms with statements ts
create fake news, against real news articles written by humans. What the authors
found was that the fake news articles were difficult to distinguish from the real
news articles. Hence, the authors could claim that, for both humans and AI, it was
difficult to distinguish between human written articles and AI created texts (Sun
et al., 2024). What that means is that, when humans cannot distinguish fake news
from real news, and when AI cannot help humans make that distinguishment, the
risk of fake news spreading as facts increases.

One more article looking into the topic of detecting fake information is ”Disinforma-
tion Detection: An Evolving Challenge in the Age of LLMs” by Jiang et al. (2024).
In that article, the researchers studied the question of how current state-of-the-art
detection models can detect LLM-generated disinformation. Firstly, they concluded
that the current state-of-the-art detection models could not detect LLM-generated
disinformation. Therefore, they continued their research with asking if LLMs could
be used instead. The authors could conclude that the technique using LLMs also
faced challenges and was therefore neither seen as a sufficient solution (Jiang et al.,
2024).

A concrete example of a study trying to detect fake information in an everyday
setting was presented by Ayoobi et al. (2023) who studied LinkedIn profiles gener-
ated by LLM. One of the main reasons the authors studied this phenomenon was
because these fake accounts rapidly could access private and sensitive information
from human users on the platform. Also, as a fake account stays active, it gains
more and more credibility leading to an increased risk of it engaging in phishing and
other kinds of scamming. This phenomenon was also covered by Beyond Identity
(2023) which reported that ”39% of people said they would fall victim to at least one
phishing scam generated by ChatGPT.” In other words, fake information on social
media platforms lead to an increased level of AI-crimes. (Ayoobi et al., 2023)

1.3.2 Counteracting problematic usage
What seems to be one solution to mitigating the problem with the spreading of fake
information, is the need to recognise AI created content. In the previously mentioned
research made by Ayoobi et al. (2023), the authors presented a possibility to mitigate
the problems with fake accounts created using LLM. What they could prove was
that, by using a certain developed algorithm it was possible to detect profiles created
by LLM at an accuracy of at least 90% in English.

However, Liang et al. (2024) could show that ”[...] roughly 7-15% of sentences in
ML conference reviews were substantially modified by AI beyond a simple grammar
check [...]”. Adding to that, the authors also showed that the words commenadable,
innovative, meticulous, intricate, notable, and versatile had increased significantly in
the ICLR 2024 (International Conference on Learning Representations 2024) peer
reviews compared to the years before, indicating an increased level of AI usage in

4


1. Introduction

these articles. What this confirms, is that the use of LLM is wide spread and has,
to this day, been used in scientific writings without being detected. Thus, whilst
Ayoobi et al. (2023) confirm that it is possible to detect AI content on social media
accounts, Liang et al. (2024) claims that doing so in academic writing seems rather
difficult.

As a consequence of these problems, so called AI detectors have popped up over
the internet. Those AI detectors claim to be able to detect AI created texts. In
a preliminary study, made by Habibzadeh (2023), one of these AI detectors named
GPTZero was tested. In that study 30 English texts written by humans and 20
English texts created by AI was put into the GPTZero for classification. The result
was a accuracy of 80%, a specificity of 90%, and a sensitivity of 65%. That could
be perceived as a good performance since it seems to not be guessing the results,
however the accuracy is too low for GPTZero to be used in a reliable way in, for
example, a school setting (Habibzadeh, 2023).

Another research looking into the performance of AI detectors was performed by
Chaka (2023) in a study looking into AI generated texts from the different LLMs and
how well five different AI detectors could detect them. What they could conclude,
was a confirmation of the findings made by (Habibzadeh, 2023) and they stated:

”[...] it is evident from this paper that all five AI content detectors seem not yet
fully ready to accurately and convincingly detect AI-generated content from machine-
generated texts in different contexts. This has dire consequences for AI-generated
plagiarism in academic essay writing.”

Source: Chaka (2023)

To understand the inability of AI detectors to accurately detect AI created content,
it is necessary to understand the functionality of AI detectors. Ma et al. (2023)
presented a reasonable exposition on how AI tools could work by focusing on the
use of perplexity as a tool when developing and employing the relevant AI.

What is perplexity?
When using perplexity in relation to AI, and more specifically Natural Language Pro-
cessing (NLP), it refers to the process of guessing the next word in a text string. The
better the AI is at guessing the next word, the lower the perplexity score gets.
When an AI language model is developed, it is usually designed to produce texts with
low perplexity. On the other hand side, when humans produce texts these often show
of higher perplexity.
Thereby, perplexity can be used as a tool for distinguishing between texts created by AI
(which should show of lower levels of perplexity) and texts written by humans (which
should show of higher levels of perplexity).

Source: Stephen M. Walker II (n.d.)

5


1. Introduction

Ma et al. (2023) could claim that the AI detector makes the distinguishment better
than what a human do, but they were unsure about the future performance. They
claimed that the AI detector will become worse and worse at doing this job as the
available AI language models develop and become more advanced. Put differently,
as AI language models develops they will most probably imitate humans more and
more, consequently leading to an increased level of perplexity also in texts created
by AI. Thereby, Ma et al. (2023) instead claim that the future focus of detecting
AI created content should focus on semantics (the meaning of words and sentences)
and pragamatics (the meaning of phrases in different contexts) rather than perplexity.
This finding confirms what was also found by Jiang et al. (2024) when they claimed
that LLMs were not successful in the work of detecting AI content, and instead
having to recognise contextual elements to do so.

This need for shift in focus regarding how AI detectors should work, is insinuated
by Elkhatat et al. (2023) in their article ”Evaluating the efficacy of AI content
detection tools in differentiating between human and AI-generated text”. The authors
agree with Ma et al. (2023) regarding that it is indeed possible for AI detectors to
recognise AI created content, but they emphasise the need to be mindful about the
capabilities of generative AI as the AI models advance. This claim was supported
by their findings about how AI detectors performed inconsistently and with a worse
performance on detecting ChatGPT 4 content compared to ChatGPT 3.5 content
(Elkhatat et al., 2023), which also confirms the conclusion made by Ma et al. (2023).

With this said, it comes clear that the work of classifying AI texts is a job that is
in constant need of development. One of these kinds of developments was suggested
by Cingillioglu (2023) who could show that, by using a Support Vector Machine
(SVM), the work of identifying human written essays could reach an accuracy of
100%. However, this came at a cost of an overall accuracy of the model of 92.7%
indicating a lower accuracy for detecting AI generated texts. The impaired ability
to detect AI content was though perceived as acceptable because of a zero tolerance
of any human written essays being classified as created by AI.

Another perspective that sometimes is claimed as being useful in the work of de-
tecting AI created content is detecting hidden patterns in texts, which usually is
referred to as watermarking. One of these available watermarking techniques was
presented by Kirchenbauer, Geiping, Wen, Katz, et al. (2023). Their technique can
be described as focused on the number of occurrences of certain preferred words in
a text string. When a text string is examined and the limit of preferred words has
been exceeded, the text string should be perceived as generated by an AI. Their
proposed solution is later strengthened by findings by Kirchenbauer, Geiping, Wen,
Shu, et al. (2023) (note: three of the authors in this article is the same as in the
previously mentioned article) and they claim:

”We advocate for a view of watermarking reliability as a function of text length, and
find that even human writers cannot reliably remove watermarks if being measured
at 1000 words, despite having the goal of removing the watermark.”

Source: Kirchenbauer, Geiping, Wen, Shu, et al. (2023)

6


1. Introduction

This technique was later criticised by Weichert and Dimobi (2024), who argue that
the assumptions to the solution does not hold. They instead claim that texts writ-
ten by humans are not as random as Kirchenbauer, Geiping, Wen, Shu, et al. (2023)
assert, and instead mean that human writing does not have a uniformly distributed
frequency of word usage. From that statement, Weichert and Dimobi (2024) could
show that ”[...] the watermarking algorithm does not accurately model human lan-
guage by assuming that tokens in human text are uniformly distributed” and they
propose further studies on the reliability of the watermarking technique.

What all this stated, it seems like there exist solutions on how to recognise AI content
on the market. However, as the performances of these AI detectors are claimed to
be on a certain level, there is nothing really proving so is actually the case.

1.4 The Swedish setting
In previous sections, the increasing presence of AI in the everyday human’s life,
and the lack of proper detection tools for detecting texts created by AI, has been
presented. Worth noticing is that the previously presented studies have all been
performed in English, meaning there is much less to conclude for languages other
than English. As for Swedish, (which for the record is a much less wide spread
language when compared to English) it is hard to find any significant studies made
on AI detectors and their performance. What could be said about AI and Swedish is
though that AI chatbots, like ChatGPT 3.5, is already capable of creating Swedish
texts. Thus, an ability to detect texts created by AI in Swedish becomes important.

As shown before, plenty of studies have claimed that AI detectors can detect English
texts created by AI, but that the AI detectors are imperfect. This raises questions
about how these AI detectors could be performing on Swedish content. There could
be a situation where the differences in languages between Swedish and English would
cause the AI detectors to perform better on Swedish than on English; worse on
Swedish than on English; or the same on both Swedish and on English.∗

Detecting AI content is a growing interest. One of the reasons for this is because of
the fast rate in which AI can produce content (both true and false) and, consequently,
how fast this produced content can be spread. Another reason is the need to know
whether something is written by a certain person, for example in school.

On the 30th of August 2023 the Swedish public service television broadcaster Sveriges
Television (SVT ) reported that:

”More than half of the young people in a recent survey from ”Ungdomsbarometern”
answer that they have, at some point, used AI for a school assignment in a way that
they themselves are not sure is allowed.” (free translation).

Source: SVT Nyheter (2023b)

Already in April 2023, SVT had reported on the matter of pupils using AI to cheat
in school. That certain reporting was regarding the national exams that could

∗What is mentioned as ”to perform” is referring to the accuracy of the different AI detectors.

7


1. Introduction

be prepared at home, and how AI could be used to cheat on those examinations.
A certain teacher that was interviewed in the article claimed that all submissions
should be completely banned (SVT Nyheter, 2023a).

Consequently, the Swedish National Agency for Education, Skolverket, published
information on their website about AI in school in the beginning of September 2023.
In that information Skolverket stated that:

”There are currently no reliable tools to determine whether a text is written by a
chatbot. The Swedish National Agency for Education advises against submitting as-
signments as a basis for grades if you, as a teacher, cannot ensure the reliability of
the content.” (free translation).

Source: Skolverket (2023)

As shown, it is difficult for both pupils and teachers to orientate around the topic
of AI. There still seems to exist a belief that it is possible to ask an AI if a text was
written by an AI or a human, and to believe to be given a trustworthy answer. This
is today unclear, especially in a Swedish setting, and research is therefore needed in
the area.

1.5 Purpose of the study
With all this stated, it is interesting to study whether AI-models could be better
performing than humans and thereby help humans make a more accurate prediction
of whether a text in Swedish is written by a human or created by an AI. That is to
say:

Could AI do the job of distinguishing between texts
created by AI and texts written by humans in Swedish?

The framing of this study is as follows:
Three AI chatbots and three AI detectors will be tested on texts created by AI and
texts (more specifically: articles) written by humans, in both English and in Swedish.
That is made to see how well the AI chatbots and the AI detectors perform on the
different inputs. The study aims to find how well these different AI chatbots and
AI detectors perform when asked to detect whether a content is made by an AI or
by a human.

The gaps in knowledge that this research seeks to fill is the lack of proper studies
on how AI detection could be carried out in Swedish. There is a lot of studies on
logic behind the function of AI detection in the English language, but there are no
real studies on Swedish. More specifically, there is no studies on how well these self
proclaimed AI detection tools perform in other languages than English. Adding to
that the requests from Swedish schools to get help with how to work around the
issue with pupils handing in homework and essays not written by themselves but
rather by using an AI, strengthens the reasons for carrying out this kind of research.

8


2
Method

This chapter is divided into two parts: the pre-study and the main study.

Since a study like this has not been taking place in a Swedish setting before, an
initial study (which will be referred to as the pre-study) was firstly undertaken.
That study tested three different AI chatbots and one AI detector on their ability
to detect AI created content. The data these AI chatbots and the AI detector were
tested on, was a dataset of in total 16 articles written by humans in Swedish, 16
articles written by humans in English, 16 texts created by AI in Swedish, and 16
texts created by AI in English. The results from the pre-study was then used to
design the succeeding main study.

The main study was designed and undertaken to study two AI detectors in more
depth. The data these AI detectors were tested on was a dataset of in total 400
articles written by humans in Swedish, 400 articles written by humans in English,
400 texts created by AI in Swedish, and 400 texts created by AI in English. The
findings from the main study are then presented as the main findings of this
research.

More detailed information is presented below.

2.1 The pre-study
The data referred to in the whole of section 2.1 The pre-study can be
found in Appendix A (for the Swedish parts) and Appendix B (for the
English parts).

The pre-study was undertaken to make an initial attempt to map the environment
of how AI classifies texts as either written by a human or created by an AI. To make
this pre-study the following three different AI chatbots were chosen:

• Copilot (Microsoft BING)

• ChatGPT 3.5

• GPT-SW3 (gpt-swe-20b-instruct)

The choice of using Copilot (Microsoft BING) and ChatGPT 3.5 was because of
them being free to use and widely available. The reason for using GPT-SW3 (gpt-
swe-20b-instruct) was because it is the first version of a GPT that was made specif-

9


2. Method

ically to be used in Swedish and because it was offered for free. Adding to those
three chatbots, a so called ”AI detector” was also chosen. The choice fell on the tool
GPTZero for its wide spread in the area of this kind of classifications and because
of it being free to use.

2.1.1 The Swedish part of the pre-study
To try to make an initial attempt to answer the question if it is possible to distinguish
between human written articles and AI created texts in Swedish, the Swedish data
set was put in a random order as presented in Appendix A. The data points were
then put in an orderly manner into the following three different AI chatbots:

• Copilot (Microsoft BING)

• ChatGPT 3.5

• GPT-SW3 (gpt-swe-20b-instruct)

2.1.1.1 The performance of the AI chatbots in Swedish

When it comes to the AI chatbots, the prompt with which they all were prompted
with, was:

Är den här texten skriven av en AI eller en människa: [current
data point from the Swedish data set]

The Swedish prompt would mean this in English: Is this text written by an AI or
a human: [current data point from the Swedish data set]. This was made for each
data point in the Swedish data set in the order presented in Appendix A.

All answers from each AI chatbot was recorded and marked as Human, AI or No
classification based on the response from the AI chatbots. The classification was
made like this:

• If the AI chatbot responded something like ”This text appears to have been
written by a human.” or in any other similar way, the classification was
recorded as Human.

• If the AI chatbot responded something like ”This text appears to have been
written by an AI.” or in any other similar way, the classification was recorded
as AI.

• If the AI chatbot responded something like ”It is uncertain whether the text
was written by a human or an AI.” or in any other similar way, the classification
was recorded as No classification.

2.1.1.2 The performance of the AI detector in Swedsih

When it comes to the AI detector, the text was loaded into the AI detector without
any additional prompt and the AI detector responded with a probability of the text

10


2. Method

being written by a human, an AI, or if it was a mix between them both. When sum-
ming the three different probabilities they summed up to 100%. The classification
was made like this:

• If the AI detector responded with a probability greater than 50% of the text
being written by a human, the classification was recorded as Human.

• If the AI detector responded with a probability greater than 50% of the text
being created by an AI, the classification was recorded as AI.

• If the AI detector responded with a probability greater than 50% of the text
being written by a human and an AI together, if the AI detector responded
with a probability lower than 50% of the text being written by a human, or if
the AI detector responded with a probability lower than 50% of the text being
written by an AI, the classification was recorded as No classification.

2.1.1.3 The Swedish data of the pre-study

To test the performance of the three AI chatbots and the AI detector on Swedish
content, 16 articles mentioned in the article ”Från chockgranater och laxkanoner
till coronakris – här är nyhetsåret 2020” by the Swedish national public television
broadcaster, SVT, were collated (SVT Nyheter, 2020). They were then ordered and
given an index between H1 and H16 (H standing for the text being written by a
human).

Following, the headlines of every indexed article was extracted and turned into
sixteen prompts that could be put into an AI chatbot. The prompts were written
in Swedish like this:

Skriv en kort nyhetsartikel med rubriken [headline of pooled
article].

The Swedish prompt would mean this in English: Write a short news article with
the headline [headline of pooled article]. The chosen AI chatbot to respond to the
prompts was ChatGPT 3.5. The 16 prepared prompts were put into the AI and the
responses from ChatGPT 3.5 were ordered and given an index between A1 and A16
(A standing for the text being created by an AI).

With the human written texts and the AI created texts at hand, a data set of 32
data points were made consisting of 16 human written texts in Swedish (H1-H16)
and 16 AI created texts in Swedish (A1-A16).

2.1.2 The English part of the pre-study
To try to make an initial attempt to answer the question if it is possible to distinguish
between human written articles and AI created texts in English, the English data
set was put in a random order as presented in Appendix B. The data points were
then put in an orderly manner into the following three different AI chatbots:

• Copilot (Microsoft BING)

11


2. Method

• ChatGPT 3.5

• GPT-SW3 (gpt-swe-20b-instruct)

2.1.2.1 The performance of the AI chatbots in English

When it comes to the AI chatbots, the prompt with which they all were prompted
with, was:

Is this text written by an AI or a human: [current data point from
the English data set]

This was made for each data point in the English data set in the order presented in
Appendix B.

All answers from each AI chatbot was recorded and marked as Human, AI or No
classification based on the response from the AI chatbots. The classification was
made like this:

• If the AI chatbot responded something like ”This text appears to have been
written by a human.” or in any other similar way, the classification was
recorded as Human.

• If the AI chatbot responded something like ”This text appears to have been
written by an AI.” or in any other similar way, the classification was recorded
as AI.

• If the AI chatbot responded something like ”It is uncertain whether the text
was written by a human or an AI.” or in any other similar way, the classification
was recorded as No classification.

2.1.2.2 The performance of the AI detector in English

When it comes to the AI detector, the text was loaded into the AI detector without
any additional prompt and the AI detector responded with a probability of the text
being written by a human, an AI, or if it was a mix between them both. When sum-
ming the three different probabilities they summed up to 100%. The classification
was made like this:

• If the AI detector responded with a probability greater than 50% of the text
being written by a human, the classification was recorded as Human.

• If the AI detector responded with a probability greater than 50% of the text
being created by an AI, the classification was recorded as AI.

• If the AI detector responded with a probability greater than 50% of the text
being written by a human and an AI together, if the AI detector responded
with a probability lower than 50% of the text being written by a human, or if
the AI detector responded with a probability lower than 50% of the text being
written by an AI, the classification was recorded as No classification.

12


2. Method

2.1.2.3 The English data of the pre-study

To test the performance of the three AI chatbots and the AI detector on English
content, four articles authored by the BBC (British Broadcasting Corporation), four
articles authored by the RTÉ (Raidió Teilifís Éireann), four articles authored by the
CNN (Cable News Network), and four articles authored by the ABC (American
Broadcasting Company) were collated as 16 articles in English authored by humans.
They were then ordered and given an index between h1 and h16 (h standing for the
text being written by a human).

Following, the headlines of every indexed article was extracted and turned into
sixteen prompts that could be put into an AI chatbot. The prompts were written
in English like this:

Write a short news article with the headline [headline of pooled
article].

The chosen AI chatbot to respond to the prompts was ChatGPT 3.5. The 16 pre-
pared prompts were put into the AI and the responses from ChatGPT 3.5 were
ordered and given an index between a1 and a16 (a standing for the text being
created by an AI).

With the human written texts and the AI created texts at hand, a data set of 32
data points were made consisting of 16 human written texts in English (h1-h16) and
16 AI created texts in English (a1-a16).

2.2 The main study
The data referred to in the whole of section 2.2 The main study can
be found in Appendix C (for the Swedish parts), Appendix D (for the
English parts), and Appendix E (for the overall result).

Based on the results from the pre-study, it came clear that there was no need to
try to test the different AI chatbots to any greater extent, since they showed of
bad performances. However, studying the performances of the self proclaimed AI
detectors came as a more reasonable study to perform.

In the main study the performance of AI detectors classifying texts as either written
by a human or created by an AI were taken to a greater extent than in the pre-study.
However, in the main study GPTZero was excluded and replaced by two other AI
detectors, namely:

• Smodin

• Copyleaks

The choice fell upon these two AI detectors mainly because of the reason that
GPTZero had showed of really bad performance in Swedish in the pre-study. It was
therefore a necessity to choose new AI detectors, that claimed to be able to detect
AI content in both English and Swedish. As such, Smodin and Copyleaks were the

13


2. Method

only two AI detectors that claimed to be able to make this kind of distinguishment
and they were thereby chosen.

The main study is an up-scaled version of the method presented by Habibzadeh
(2023) in his article ”GPTZero performance in identifying artificial intelligence-
generated medical texts: a preliminary study”. This meant that the Swedish and
the English datasets of 800 data points each, were loaded into the AI detectors
without any additional prompt. On the prompt, the AI detectors responded with a
probability of the text being written by an AI. The classification was made like this:

• If the AI detector responded with a probability smaller than 50% of the text
being written by an AI, the classification was recorded as Human.

• If the AI detector responded with a probability of 50% or greater of the text
being created by an AI, the classification was recorded as AI.

• If the AI detector responded with a probability lower than 0% or higher than
100% of the text being written by an AI, the classification was recorded as
Error.

When choosing articles to classify, articles published before 2021 was selected. If
these articles had been updated after the end of 2020 though, they were excluded.

2.2.1 The Swedish data of the main study
To test the performance of the two AI detectors on Swedish content, articles written
by humans and texts created by AI was needed. Therefore, 400 articles were selected
from the dataset News articles and front pages from 19 Swedish news sites during
the covid-19/corona pandemic 2020–2021 (Dahlgren, n.d.). The selection was made
as following:

1. The references marked as both page_type: specific article and article_genre:
news article were selected, and the other references were excluded.

2. Then all the references that were marked as having a paywall were excluded.

3. Then all references published after 2020-12-31 were excluded.

4. Then all references updated after 2020-12-31 were excluded.

5. Then all references were given a random value between 0 and 1.

6. Then the references were sorted in ascending order.

The process then started to go throguh every reference in an orderly manner. This
meant manually going through the references until having found 400 articles that
met the needs for the study. When going through the references, references to dead
ends (usually because the article had been deleted) were excluded. Also articles on
the form of bullet lists were excluded. In total 877 references were checked before
concluding in 400 articles, meaning that 477 references were passed.

When having found 400 articles, they were collated into a data set of 400 data points
of Swedish articles written by humans.

14


2. Method

Adding to the 400 articles written by humans were 400 pooled texts created by AI.
To get 400 AI created texts, a procedure as following was followed:

1. Each article (data point) in the datset of Swedish articles were put in the
ChatGPT 4 AI chatbot with a prompt on this form: Sammanfatta följande
artikel i en mening: [the pooled article]. (The Swedish prompt would
mean this in English: Summarize the following article in one sentence: [the
pooled article].)

2. The summarizations were collated into a dataset of in total 400 summarizations
of the Swedish human written articles.

3. Each one sentence summarization (data point) in the datset of summarizations
of the Swedish human written articles were put in the ChatGPT 4 AI chatbot
with a prompt on this form: Skriv en artikel om: [the one sentence
summarization of the pooled Swedish article]. (The Swedish prompt
would mean this in English: Write an article about: [the one sentence summa-
rization of the pooled Swedish article].)

4. Each text created by the AI were saved.

When having been given 400 texts created by AI, they were collated into a data set
of 400 data points of Swedish texts created by AI.

Now, the data set of 400 data points of Swedish articles written by humans was
accompanied by the data set of 400 data points of Swedish texts created by AI.
Thus, they now constituted a merged data set of, in total, 800 data points of Swedish
articles written by humans and texts created by AI.

2.2.2 The English data of the main study
To test the performance of the two AI detectors on English content, articles written
by humans and texts created by AI was needed. Therefore, 400 articles were selected
from the dataset All the News 2.0 – 2.7 million news articles and essays from 27
American publications (Thompson, n.d.). The selection was made as following:

1. The references published in any other year than 2020 were excluded.

2. Then all references were given a random value between 0 and 1.

3. Then the references were sorted in ascending order.

The process then started to go throguh every reference in an orderly manner. This
meant manually going through the references until having found 400 articles that met
the needs for the study. When going through the references, references to articles
updated after 2020-12-31 were excluded. Also, references to dead ends (usually
because the article had been deleted) and articles on the form of bullet lists were
excluded. In total 852 references were checked before concluding in 400 articles,
meaning that 452 references were passed.

When having found 400 articles, they were collated into a data set of 400 data points
of English articles written by humans.

15


2. Method

Adding to the 400 articles written by humans were 400 pooled texts created by AI.
To get 400 AI created texts, a procedure as following was followed:

1. Each article (data point) in the datset of English articles were put in the Chat-
GPT 4 AI chatbot with a prompt on this form: Summarize the following
article in one sentence: [the pooled article].

2. The summarizations were collated into a dataset of in total 400 summarizations
of the English human written articles.

3. Each one sentence summarization (data point) in the datset of summariza-
tions of the English human written articles were put in the ChatGPT 4 AI
chatbot with a prompt on this form: Write an article about: [the one
sentence summarization of the pooled Swedish article].

4. Each text created by the AI were saved.

When having been given 400 texts created by AI, they were collated into a data set
of 400 data points of English texts created by AI.

Now, the data set of 400 data points of English articles written by humans was
accompanied by the data set of 400 data points of English texts created by AI into
a merged data set of, in total, 800 data points of English articles written by humans
and texts created by AI.

16


3
Results

This chapter is divided into two parts. First comes the part presenting the results
from the pre-study, and then comes the part presenting the results from the main
study.

3.1 The pre-study
Below is the results from the so called pre-study presented. Starting with a section
focusing on the Swedish data set, and ending with a section focusing on the English
data set.

3.1.1 The Swedish part of the pre-study
In this section, the result from the Swedish pre-study is presented. In the same way
as Habibzadeh (2023) presented his findings, the findings from this study will be
presented by the use of confusion matrices. The first part focuses on the performance
of the different AI chatbots, and the second part focuses on the performance of the
AI detector.

3.1.1.1 The performance of the AI chatbots in Swedish

The 32 prompts, followed by the relevant Swedish data point, were put into each AI
chatbot. The performance of the different AI chatbots are as follows:

Table 3.1: Confusion matrix of classifications made by Copilot (Copilot (Microsoft
BING) when classifying Swedish texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 59%

Pos (AI) 11 0 5 Sensitivity: 69%
Neg (Human) 6 2 8 Specificity: 50%

17


3. Results

Table 3.2: Confusion matrix of classifications made by ChatGPT 3.5 when classify-
ing Swedish texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 22%

Pos (AI) 0 13 3 Sensitivity: 0%
Neg (Human) 0 9 7 Specificity: 44%

Table 3.3: Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b-
instruct) when classifying Swedish texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 56%

Pos (AI) 3 2 11 Sensitivity: 19%
Neg (Human) 1 0 15 Specificity: 94%

3.1.1.2 The performance of the AI detector in Swedish

Each Swedish data point was then put into the AI detector, resulting in a total
amount of inputs of 32 times. The performance of the AI detector is as follows:

Table 3.4: Confusion matrix of classifications made by GPTZero when classifying
Swedish texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 50%

Pos (AI) 0 0 16 Sensitivity: 0
Neg (Human) 0 0 16 Specificity: 100%

To each prediction, GPTZero presents a probability span on how much it believes a
text to be entirely written by a human, AI or a mix of the two. The result on the
32 data points in the data set is presented in figure 3.1.

18


3. Results

Figure 3.1: GPTZero probability span when classifying Swedish texts in the pre-study

A
1

A
2

A
3

A
4

A
5

A
6

A
7

A
8

A
9

A
10

A
11

A
12

A
13

A
14

A
15

A
16 H
1

H
2

H
3

H
4

H
5

H
6

H
7

H
8

H
9

H
10

H
11

H
12

H
13

H
14

H
15

H
16

0
10
20
30
40
50
60
70
80
90

100

%
of

pr
ob

ab
ili

ty

Human Mixed AI

As can be seen in figure 3.1, GPTZero presents a constant probability of every text
being created by AI at 0%. In figure 3.2 below is a boxplot presented to show the
variance of the calculated probability of the GPTZero algorithm.

Figure 3.2: Box plot of GPTZero probability span when classifying Swedish texts in
the pre-study

0 10 20 30 40 50 60 70 80 90 100

H1-H16 % Mixed

A1-A16 % Mixed

H1-H16 % Human

A1-A16 % Human

3.1.2 The English part of the pre-study
In this section, the result from the English pre-study is presented. In the same way
as Habibzadeh (2023) presented his findings, the findings from this study will be
presented by the use of confusion matrices. The first part focuses on the performance
of the different AI chatbots, and the second part focuses on the performance of the
AI detector.

19


3. Results

3.1.2.1 The performance of the AI chatbots in English

The 32 prompts, followed by the relevant English data point, was put into each AI
chatbot. The performance of the different AI chatbots are as follows:

Table 3.5: Confusion matrix of classifications made by Copilot (Copilot (Microsoft
BING) when classifying English texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 50%

Pos (AI) 0 0 16 Sensitivity: 0
Neg (Human) 0 0 16 Specificity: 100%

Table 3.6: Confusion matrix of classifications made by ChatGPT 3.5 when classify-
ing English texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 41%

Pos (AI) 0 13 3 Sensitivity: 0
Neg (Human) 0 3 13 Specificity: 81%

Table 3.7: Confusion matrix of classifications made by GPT-SW3 (gpt-swe-20b-
instruct) when classifying English texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 63%

Pos (AI) 9 1 6 Sensitivity: 56%
Neg (Human) 5 0 11 Specificity: 69%

20


3. Results

3.1.2.2 The performance of the AI detector in English

Each English data point was then put into the AI detector, resulting in a total
amount of inputs of 32 times. The performance of the AI detector is as follows:

Table 3.8: Confusion matrix of classifications made by GPTZero when classifying
English texts in the pre-study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 100%

Pos (AI) 16 0 0 Sensitivity: 100%
Neg (Human) 0 0 16 Specificity: 100%

To each prediction, GPTZero presents a probability span on how much it believes a
text to be entirely written by a human, AI or a mix of the two. The result on the
32 data points in the data set is presented below.

Figure 3.3: GPTZero probability span when classifying English texts in the pre-study

a1 a2 a3 a4 a5 a6 a7 a8 a9 a1
0

a1
1

a1
2

a1
3

a1
4

a1
5

a1
6 h1 h2 h3 h4 h5 h6 h7 h8 h9 h1
0

h1
1

h1
2

h1
3

h1
4

h1
5

h1
60

10
20
30
40
50
60
70
80
90

100

%
of

pr
ob

ab
ili

ty

Human Mixed AI

As can be seen in figure 3.3, GPTZero presents a good confidence in each correct
classification. In figure 3.4 below is a boxplot presented to show the variance of the
calculated probability of the GPTZero algorithm.

21


3. Results

Figure 3.4: Box plot of GPTZero probability span when classifying English texts in
the pre-study

0 10 20 30 40 50 60 70 80 90 100

h1-h16 % AI

a1-a16 % AI

h1-h16 % Mixed

a1-a16 % Mixed

h1-h16 % Human

a1-a16 % Human

22


3. Results

3.2 The main study
Below is the results from the main study presented. Likewise as Habibzadeh (2023)
presented his findings, the results are presented by the use of confusion matrices.

3.2.1 The classifications
The results are presented in four different confusion matrices as follows:

Table 3.9: Classification by Smodin on Swedish texts in the main study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 74%

Pos (AI) 193 12 195 Sensitivity: 48%
Neg (Human) 2 1 397 Specificity: 99%

13 data points were classified as ”No classification”.

Table 3.10: Classification by Smodin on English texts in the main study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 95%

Pos (AI) 373 18 9 Sensitivity: 93%
Neg (Human) 14 1 385 Specificity: 96%

19 data points were classified as ”No classification”.

Table 3.11: Classification by Copyleaks on Swedish texts in the main study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 95%

Pos (AI) 359 0 41 Sensitivity: 90%
Neg (Human) 0 0 400 Specificity: 100%

0 data points were classified as ”No classification”.

Table 3.12: Classification by Copyleaks on the English texts in the main study

PREDICTION

A
C

T
U

A
L Pos (AI) Neu Neg (Human) Accuracy: 100%

Pos (AI) 400 0 0 Sensitivity: 100%
Neg (Human) 0 0 400 Specificity: 100 %

0 data points were classified as ”No classification”.

23


3. Results

3.2.2 Confidence intervals for the classifications
To make sure that the data presented is reliable, confidence intervals of the different
accuracies have been calculated. The confidence intervals have been calculated as a
proportion of an infinite population and the stocastic variable follows a bionomial
distribution. Since each sample size consist of 800 data points, the value of p is
estimated using p̂ = X

n
, where X is representing the correctly classified data points

in each cluster, and n is the total amount of data points in each cluster, hence 800.
Each confidence interval has then been calculated with the following formula:

p̂ ± Zα

n

√
p̂(1 − p̂)

For the confidence intervals on 95%, Zα was set to 1.96, and for the confidence
intervals on 99%, Zα was set to 2.58.

The maximum span of the confidence intervals in table 3.13 on the 95% confidence
level is 0.22 percentage points, and the maximum span of the confidence intervals
on the 99% confidence level is 0.28 percentage points.

Table 3.13: Confidence intervals for the accuracies of the different classifications
made in the main study clustered in groups of Swedish and English (as presented in
the confusion matrices).

CONFIDENCE LEVELS 95% 99%
Classification Accuracy Lower Upper Lower Upper
Smodin on Swedish 73.75% 73.64% 73.86% 73.61% 73.89%
Smodin on English 94.75% 94.70% 94.80% 94.68% 94.82%
Copyleaks on Swedish 94.88% 94.82% 94.93% 94.80% 94.95%
Copyleaks on English 100.00% 100.00% 100.00% 100.00% 100.00%

The maximum span of the confidence intervals in table 3.14 on the 95% confidence
level is 0.22 percentage points, and the maximum span of the confidence intervals
on the 99% confidence level is 0.30 percentage points.

Table 3.14: Confidence intervals for the accuracies of the different classifications
made in the main study clustered in groups of human written articles and AI created
texts.

CONFIDENCE LEVELS 95% 99%
Classification Accuracy Lower Upper Lower Upper
Smodin on Human 97.75% 97.71% 97.79% 97.70% 97.80%
Smodin on AI 70.75% 70.64% 70.86% 70.60% 70.90%
Copyleaks on Human 100.00% 100.00% 100.00% 100.00% 100.00%
Copyleaks on AI 94.88% 94.82% 94.93% 94.80% 94.95%

24


4
Discussion

4.1 Discussion
All studies being undertaken in this research, in other words, the studies performed
on the three AI chatbots and the three AI detectors have only been performed by
one person - the author of this report. In this way, it has been assured that there has
been no differences in how the, in total, six different studies have been undertaken.
Subliminal differences that come undetected when two different persons attempt to
do the same work has thereby been avoided when only one person has been carrying
out the research.

Regarding the main study, the inputs into the two different AI detectors were made
in a relative short time span (from the 2th of May 2024 to the 4th of May 2024),
mitigating the risk of facing bad results from updates to the AI detectors.

This study was conducted in an attempt to understand whether AI could do the job
of distinguishing between texts created by AI and texts written by humans, with
a focus on Swedish. The results from the pre-study have not been concise, mainly
because of differences in how different AI work. However, it is important to consider
the results from the main study, and remembering that they are showing significant
findings.

4.1.1 The pre-study
The findings from the pre-study on Swedish shows that, when asking the three
studied AI algorithms Copilot (Microsoft BING), ChatGPT 3.5, and GPT-SW3
(gpt-swe-20b-instruct) none of the AI chatbots presented reliable data. The same
finding was the case for the pre-study on English. This confirms what was presented
by The Guardian (2023), that a LLM predicts the most likely next word in a sequence
of words or sentences and does not automatically understand that it should analyse
the presented text.

On Swedish, Copilot (Microsoft BING) showed of too low values of accuracy, sensitiv-
ity and specificity (all around 60-65%). ChatGPT 3.5 had higher values of accuracy
and specificity, but presented values for sensitivity of 0% since it never classified
any data point as created by an AI. Lastly, GPT-SW3 (gpt-swe-20b-instruct) also
showed of too low values of accuracy, sensitivity and specificity, however with a
greater spread than what Copilot (Microsoft BING) showed (spanning from 58% on

25


4. Discussion

specificity up to 75% on accuracy). Also, worth noticing when it comes to GPT-SW3
(gpt-swe-20b-instruct) is that it showed of a greater amount of wrongfully classified
articles written by humans (11 classified as created by AI) than the total amount of
correctly classified data points (which was 3 correctly classified texts created by AI,
and 5 correctly classified articles written by humans - a total of 8).

On English, Copilot (Microsoft BING) had bland values of accuracy and specificity,
and lacked values for sensitivity since it never clearly classified any data point as
created by an AI. ChatGPT 3.5 had higher values of accuracy and specificity, but
also lacked values for sensitivity. This is since it never clearly classified any data
point as created by an AI, instead it marked those as neutral inputs. Lastly, GPT-
SW3 (gpt-swe-20b-instruct) also showed of too low values of accuracy, sensitivity
and specificity.

What this shows is that the pre-study did not give much to hope for the poten-
tial performances of the three different AI chatbots. Regardless of whether it was
Swedish or English that was studied, the accuracies of the AI chatbots performances
where too low, coming as well from bad sensitivities as well from bad specificities.
This confirms what was presented in the Introduction chapter, that ”[...] the LLM
model does not understand a text as a human does, instead it is a mathematical
model that predicts the most likely next word in a sequence of words or sentences.”
This means that, regardless of if the AI chatbots have the ability to analyze and
classify a text, they do not possess the ability to ”understand” that they should take
the part of the prompt containing the text to be analysed and classify it. What that
means is that, the way in which the AI chatbots are structured, hinders them from
performing the task of classifying a text input. Therefore, the certain types of AI
detectors should be used instead.

Looking into the performance of the AI detector in the pre-study, different findings
were presented on Swedish and on English. What can be concluded from looking
at the confusion matrix for that GPTZero marks every Swedish input as an article
written by a human. This is most probably because GPTZero is trained on English
and only recognizes English words. In other words, what was presented by Ma et al.
(2023) could now be used to state that every word in Swedish would show of high
perplexity, since every word in every input is a word that GPTZero had not seen
before, leading to every input being classified as written by a human. However, when
examining the confusion matrix for the Enlgish inputs, GPTZero correctly classifies
every input, presenting a better result than what (Habibzadeh, 2023) presented.
This shows that the AI is capable of performing the work of distinguishing between
human made and AI created texts in English, but it fails at doing so on Swedish.

The difference in performance of the AI detector on Swedish and English was an
interesting development. This led to a reasoning that, for an AI detector to be
able to detect a certain language, the AI detector must be competent in detecting
that certain language too. When therefore looking into the AI detector used in the
pre-study, it seemed like it could not detect Swedish. Worth noticing too is that,
since the AI detector could correctly classify all English texts whilst classifying all
Swedish texts as written by humans, it seemed like the AI detector was only looking

26


4. Discussion

for AI content. In other words, when the AI detector detected something it did
not recognise as being created by AI, it was classified as written by a human. This
should also explain why the AI detector could correctly classify the texts in English,
whilst failing in Swedish. What that means is that, since the AI detector could
not recognized Swedish texts created by an AI, it simply marked it as written by a
human.

With all this presented, it had come clear that there was no need to study AI
chatbots more in this study because of their inability to detect AI content. On the
other hand, it was obvious that to be able to study AI detectors properly, the AI
detectors must have the ability to detect Swedish language. Therefore, the main
study was performed on two AI detectors that claimed to be able to detect both
Swedish and English.

4.1.2 The main study
The main study focused on the study of articles written by humans and texts created
by AI. More specifically 400 articles written by humans in Swedish, 400 articles
written by humans in English, 400 texts created by AI in Swedish, and 400 texts
created by AI in English.

To start of with, the performance of the Smodin AI detector was better than
GPTZero on Swedish. However, the performance of the Smodin AI detector was
worse than GPTZero on English. This difference in performance on the different
languages can be perceived as the result of two different phenomenons. The first
one, is about the ability of Smodin to detect Swedish, which was the main problem
of GPTZero. What that means is that, the better performance of the Smodin AI
detector should mainly be considered a result of the actual ability of Smodin to
detect Swedish content, something that GPTZero was incapable of. The second one,
is about the increased amount of data points in the main study (800) as compared
to the pre-study (32), and how that affects the performance of Smodin on English.
That means that, there should be no uncertainties that the slightly lower perfor-
mance of Smodin on a few percentage points, as compared to GPTZero, could be
regarded as a result of a bigger and more diverse data set. Also, worth mentioning is
that, as presented by Elkhatat et al. (2023), development in AI models could make
them harder to detect, and since the pre-study was carried out by using ChatGPT
3.5 and the main study used ChatGPT 4, this slight lower performance of Smodin
might be due to an increased complexity of ChatGPT 4.

When looking into the performances by the Smodin AI detector on Swedish, it is
clear that the accuracy performance of 74% is mainly a consequence of wrongful
predictions on the actual AI created texts. In other words, the low accuracy is
mainly derived from a low sensitivity, since as many as half of the classified AI
created texts (195 of 400) were classified as human written articles (false negatives).

Taking a look at the Copyleaks AI detector on Swedish, it shows of a sensitivity
of 100% as it is never wrongfully classifying any of the articles written by humans.
However, every tenth text created by AI were classified as written by a human, which

27


4. Discussion

is confirmed by a sensitivity of 90% concluding in an overall accuracy of 95%.

Moving forward to the English part of the main study, the performance of the
Smodin AI detector is good with high values of accuracy, sensitivity, and specificity.
However, the performance of the Smodin AI detector is beaten by the Copyleaks
AI detector on English, since the latter shows a constant 100% performance on all
three of accuracy, sensitivity, and specificity.

As was presented in section 3.2.2 Confidence intervals for the classifications, the
span of the confidence intervals on 99% was at maximum 0.30 percentage points.
This proves that the accuracies of the different AI detectors should be regarded as
very accurate.

Moving forward, the performance of the Smodin AI detector has good values in
Swedish, and even better in English. Nonetheless, even though it is better perform-
ing than GPTZero, it has brought a higher amount of wrongfully classified texts
written by humans. This means that, when using Smodin as detector, some texts
written by humans will actually be classified as created by AI. This is not a re-
quested state, since in a school setting it could be devastating to wrongfully accuse
a pupil or a student for having been using AI to create a text, whilst actually having
written the text themselves. One has to be extra cautious when testing pupils and
student for academic dishonesty (Reuters, 2023), and even more cautious when it
comes to higher levels of academic writing (Habibzadeh, 2023). In other words, for
the purpose of this study, the aim is to find a solution that has no ”false positives”
classifications. Therefore, Smodin AI is not perceived as a reliable option.

When looking at Copyleaks AI detector, that certain AI detector is performing bet-
ter. Not only does it propose a solution that has no ”false positives” classifications
on both English and Swedish, but it also shows the best values of accuracy, sensi-
tivity and specificity. In English, there is nothing much to say - the performance
of Copyleaks on English is 100% accurate. When looking at the performance of
Copyleaks on Swedish, it is important to notice that every tenth text created by AI
is wrongfully classified as written by a human. Also, since there are no wrongfully
classified articles written by humans, the fact that nine out of ten texts created by
an AI will be correctly classified as created by an AI comes as a good performance.
This shows that it is possible today for AI to distinguish between texts written by
AI and articles written by humans.

Worth mentioning is also the fact that Copyleaks performs with a 100% accuracy
(99% confidence level) on all articles written by humans. What that means is that,
in this research not a single article written by a human was wrongfully classified
as being created by an AI. In other words, the Copyleaks AI detector does not
wrongfully classify any article written by a human regardless of if it being written
in English or in Swedish.

28


5
Conclusion

5.1 Conclusion
With all this stated, it is safe to say that AI can do the job of distinguishing between
texts created by AI and texts written by humans. However, it must be emphasised
that it is not possible to ask an AI chatbot if a text is created by an AI or written by
a human and expect an honest answer. On the other hand side, what is possible is
for AI detectors to distinguishing between texts created by AI and texts written by
humans. On English, the best performing AI detector showed an accuracy of 100%,
and on Swedish, the best performing AI detector showed an accuracy of hardly
95%. The performance on English is clearly perfect, and whilst the performance on
Swedish is lower, it is still good.

What can be said is that the specificity for the best performing AI detector on
Swedish was 100%, meaning that no texts written by humans were wrongfully clas-
sified as created by AI. However, the sensitivity of 90% on the same AI detector
confirms that every tenth text created by an AI will wrongfully be classified as writ-
ten by a human. Thus, one could claim that AI can do the job of distinguishing
between texts created by AI and texts written by humans, where it is important
that texts written by humans are not wrongfully classified, but where some sort of
wrongful classification of texts created by AI is accepted. This could be in a setting
like the Swedish education system, where the results from the AI detection tool
could be used for disciplinary actions.

The conclusion is thereby that, if you want the AI detector to detect Swedish texts
created by AI and accept that every tenth of those will be wrongfully classified as
written by a human, and as well not wanting to wrongfully classify texts written by
humans, today an AI can do the job for you. With that said, since AI is constantly
evolving and getting more delicate, one should still be careful with what results to
expect for other texts created by AI on a latter day than today.

29


5. Conclusion

5.2 Limitations
This study has only been made with text created by one AI, merely ChatGPT (in
the pre-study version 3.5, and for the main study version 4). This study cannot
therefore be used to say anything about the performance of detecting texts created
by any other AI. Also, as this study is presented, Openai has just released GPT-4o,
which Openai (2024) themselves describe as ”[...] our new flagship model that can
reason across audio, vision, and text in real time.” There is nothing in this study
that can confirm whether this GPT-4o could be detected by an AI, therefor one
should act cautious when wanting to detect texts created by AI already tomorrow.

It is also worth mentioning that the AI detectors studied in the main study might
well improve, as well as decline, in performance as the development of generative AI
proceeds. With that said, it is important to remember that the performance of the
AI detectors studied in this research might be totally different tomorrow.

Another thing to be cautious about, is that the study has been focused around
articles. The texts written by humans in this study were articles of different kinds,
and the texts created by AI in this study (ChatGPT 4) were explicitly prompted to
look like articles. In that way, it is difficult to draw any secure conclusions about
what these findings would imply in a school setting, since the texts written in schools
differs from what an article in a newspaper looks like.

Finally, it is important to mention that this study has been presenting texts purely
written by humans and purely created by AI. This means, that the conclusions
regarding the findings in this study is only applicable for texts of these kinds. If
a text would have been created by an AI and thereafter being manipulated by a
human, or to say: if a text was to be 50/50 written by a human and created by an
AI, then this study cannot be used to conclude anything about the ability of an AI
detector to correctly classify those kinds of texts.

5.3 Ethics
In the setting of classifying the creation of a human, the ethical aspects of wrongfully
classifying a text written by a human as a text created by an AI could cause negative
consequences for the individual. If the classification would be the foundation for a
disciplinary action, as an academic suspension, one must be sure that the basis for
the decision is correct. Hence, one should still be cautious and careful about the
usage of these kinds of AI detectors when using them for, for example, disciplinary
reasons.

From another perspective, it is also important to remember that what has been
said about Smodin and Copyleaks in this research is only applicable for those AI
detectors. That means, when using those AI detectors after they have been updated,
or when using another AI detector, one should not be too confident in relying on
their classifications.

30


5. Conclusion

It is also important to remember that some people, through their personal way of
writing, will write more AI-like than other people. This means, that if a human
writes texts similarly to how an AI would have written it, it could be the sole reason
to why a text written by that person would be classified as created by an AI instead
of written by a human.

5.4 Further studies
As was presented in the previous section, this study has been presenting texts purely
written by humans and purely created by AI. Therefore, it would have been inter-
esting to see further studies on texts partially created by AI and partially written
by humans.

Also, it would have been interesting to see an extensive research on the ability of AI
detectors to detect content created by AI in the different Nordic languages. It would
have been interesting to understand if there are subliminal similarities or differences
in the languages that makes an AI detector trained on, for example, Swedish also
able to detect AI content in, for example, Icelandic.

31


5. Conclusion

”It’s murder on the dancefloor.” hummed Oliver Quick as
he glanced over to Mr. Ripley.

32


Bibliography

Ayoobi, N., Shahriar, S., & Mukherjee, A. (2023). The looming threat of fake and
llm-generated linkedin profiles: Challenges and opportunities for detection
and prevention. Proceedings of the 34th ACM Conference on Hypertext and
Social Media. https://doi.org/10.1145/3603163.3609064

Beyond Identity. (2023). The future of hacking is ai. Retrieved September 24, 2024,
from https : / / www . beyondidentity. com / reports - guides / the - future - of -
hacking-is-ai

Bisi, T., Risser, A., Clavert, P., Migaud, H., & Dartus, J. (2023). What is the rate of
text generated by artificial intelligence over a year of publication in orthope-
dics traumatology: Surgery research? analysis of 425 articles before versus
after the launch of chatgpt in november 2022. Orthopaedics Traumatology:
Surgery Research, 109 (8), 103694. https://doi.org/https://doi.org/10.1016/
j.otsr.2023.103694

Bushard, B. (2023). Fake scientific abstracts written by chatgpt fooled scientists,
study finds. Retrieved September 24, 2024, from https://www.forbes.com/
sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-
fooled-scientists-study-finds/

Chaka, C. (2023). Detecting ai content in responses generated by chatgpt, youchat,
and chatsonic: The case of five ai content detection tools. Journal of Applied
Learning and Teaching, 6 (2).

Cingillioglu, I. (2023). Detecting ai-generated essays: The chatgpt challenge. The
International Journal of Information and Learning Technology, 40 (3), 259–
268.

Dahlgren, P. M. (n.d.). News articles and front pages from 19 swedish news sites
during the covid-19/corona pandemic 20202021. https://doi.org/https://doi.
org/10.5878/d18f-q220

Elkhatat, A. M., Elsaid, K., & Almeer, S. (2023). Evaluating the efficacy of ai
content detection tools in differentiating between human and ai-generated
text. International Journal for Educational Integrity, 19 (1), 17.

Habibzadeh, F. (2023). Gptzero performance in identifying artificial intelligence-
generated medical texts: A preliminary study. Journal of Korean Medical
Science, 38 (38).

Hitsuwari, J., Ueda, Y., Yun, W., & Nomura, M. (2023). Does human–ai collabo-
ration lead to more creative art? aesthetic evaluation of human-made and
ai-generated haiku poetry. Computers in Human Behavior, 139, 107502.

33

https://doi.org/10.1145/3603163.3609064
https://www.beyondidentity.com/reports-guides/the-future-of-hacking-is-ai
https://www.beyondidentity.com/reports-guides/the-future-of-hacking-is-ai
https://doi.org/https://doi.org/10.1016/j.otsr.2023.103694
https://doi.org/https://doi.org/10.1016/j.otsr.2023.103694
https://www.forbes.com/sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-fooled-scientists-study-finds/
https://www.forbes.com/sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-fooled-scientists-study-finds/
https://www.forbes.com/sites/brianbushard/2023/01/10/fake-scientific-abstracts-written-by-chatgpt-fooled-scientists-study-finds/
https://doi.org/https://doi.org/10.5878/d18f-q220
https://doi.org/https://doi.org/10.5878/d18f-q220


Bibliography

Jiang, B., Tan, Z., Nirmal, A., & Liu, H. (2024). Disinformation detection: An evolv-
ing challenge in the age of llms. In Proceedings of the 2024 siam international
conference on data mining (sdm) (pp. 427–435). https://doi.org/10.1137/1.
9781611978032.50

King, T. C., Aggarwal, N., Taddeo, M., & Floridi, L. (2020). Artificial intelligence
crime: An interdisciplinary analysis of foreseeable threats and solutions. Sci-
ence and engineering ethics, 26, 89–120.

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A
watermark for large language models. International Conference on Machine
Learning, 17061–17084.

Kirchenbauer, J., Geiping, J., Wen, Y., Shu, M., Saifullah, K., Kong, K., Fernando,
K., Saha, A., Goldblum, M., & Goldstein, T. (2023). On the reliability of
watermarks for large language models. arXiv preprint arXiv:2306.04634.

Liang, W., Izzo, Z., Zhang, Y., Lepp, H., Cao, H., Zhao, X., Chen, L., Ye, H., Liu,
S., Huang, Z., et al. (2024). Monitoring ai-modified content at scale: A case
study on the impact of chatgpt on ai conference peer reviews. arXiv preprint
arXiv:2403.07183.

Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., & Liu, X. (2023). Ai vs.
human–differentiation analysis of scientific content generation. arXiv, 2301.

Mitra, A., Mohanty, S. P., & Kougianos, E. (2024). The world of generative ai:
Deepfakes and large language models. arXiv preprint arXiv:2402.04373.

Openai. (2024). Hello gpt-4o. Retrieved May 20, 2024, from https://openai.com/
index/hello-gpt-4o/

Partadiredja, R. A., Serrano, C. E., & Ljubenkov, D. (2020). Ai or human: The socio-
ethical implications of ai-generated media content. 2020 13th CMI Conference
on Cybersecurity and Privacy (CMI)-Digital Transformation-Potentials and
Challenges (51275), 1–6.

Reuters. (2023). Chatgpt sets record for fastest-growing user base - analyst note.
Retrieved May 16, 2024, from https://www.reuters.com/technology/chatgpt-
sets-record-fastest-growing-user-base-analyst-note-2023-02-01/

Skolverket. (2023). Ai i skolan risker och möjligheter. Retrieved May 15, 2024, from
https://www.skolverket.se/om-oss/aktuellt/nyheter/nyheter/2023-09-05-ai-
i-skolan---risker-och-mojligheter

Stephen M. Walker II. (n.d.). Perplexity in ai and nlp. Retrieved May 15, 2024, from
https://klu.ai/glossary/perplexity

Sun, Y., He, J., Cui, L., Lei, S., & Lu, C.-T. (2024). Exploring the deceptive power
of llm-generated fake news: A study of real-world detection challenges. arXiv
preprint arXiv:2403.18249.

SVT Nyheter. (2020). Från chockgranater och laxkanoner till coronakris – här är
nyhetsåret 2020. Retrieved February 22, 2024, from https ://www.svt .se/
nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-
har-ar-nyhetsaret-2020

SVT Nyheter. (2023a). Lärare riktar kritik mot skolverket ser risk för ai-fusk: na-
tionella proven borde stoppats. Retrieved May 15, 2024, from https://www.
svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-
for-ai-fusk

34

https://doi.org/10.1137/1.9781611978032.50
https://doi.org/10.1137/1.9781611978032.50
https://openai.com/index/hello-gpt-4o/
https://openai.com/index/hello-gpt-4o/
https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
https://www.skolverket.se/om-oss/aktuellt/nyheter/nyheter/2023-09-05-ai-i-skolan---risker-och-mojligheter
https://www.skolverket.se/om-oss/aktuellt/nyheter/nyheter/2023-09-05-ai-i-skolan---risker-och-mojligheter
https://klu.ai/glossary/perplexity
https://www.svt.se/nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-har-ar-nyhetsaret-2020
https://www.svt.se/nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-har-ar-nyhetsaret-2020
https://www.svt.se/nyheter/lokalt/varmland/fran-chockgranater-och-laxkanoner-till-coronakris-har-ar-nyhetsaret-2020
https://www.svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-for-ai-fusk
https://www.svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-for-ai-fusk
https://www.svt.se/nyheter/lokalt/vastmanland/larare-riktar-kritik-mot-skolverket-risk-for-ai-fusk


Bibliography

SVT Nyheter. (2023b). Ny rapport: Många elever använder ai för att fuska i skolan.
Retrieved May 15, 2024, from https : //www.svt . se/nyheter/ inrikes/ny -
rapport-manga-elever-anvander-ai-for-att-fuska-i-skolan

The Guardian. (2023). How ai chatbots like chatgpt or bard work visual explainer.
Retrieved May 16, 2024, from https://www.theguardian.com/technology/ng-
interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-
explainer

Thompson, A. (n.d.). All the news 2.0 – 2.7 million news articles and essays from
27 american publications. https://components.one/datasets/all-the-news-2-
news-articles-dataset

Weichert, J., & Dimobi, C. (2024). Dupe: Detection undermining via prompt engi-
neering for deepfake text. arXiv preprint arXiv:2404.11408.

35

https://www.svt.se/nyheter/inrikes/ny-rapport-manga-elever-anvander-ai-for-att-fuska-i-skolan
https://www.svt.se/nyheter/inrikes/ny-rapport-manga-elever-anvander-ai-for-att-fuska-i-skolan
https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer
https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer
https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer
https://components.one/datasets/all-the-news-2-news-articles-dataset
https://components.one/datasets/all-the-news-2-news-articles-dataset


Bibliography

36


A
The Swedish pre-study

In this appendix the research from the so called Swedish pre-study is presented. On
page II the part depicting the articles written by humans at SVT Nyheter starts. On
page VI the part depicting the texts created by AI through ChatGPT 3.5 starts.

The articles written by humans at SVT Nyheter are presented with the following
information:

• The order of the article in the sequence.

• The index for the article (H1-H16 ) where H stands for Human.

• The date when the article was published.

• The latest date when the article was updated.

• The date when the article was retrieved.

• The source to the article.

The texts created by AI through ChatGPT 3.5 are presented with the following
information:

• The order of the text in the sequence.

• The index for the text (A1-A16 ) where A stands for Artificial Intelligence.

• The index for the article written by a human (H1-H16 ) that the text is pooled
to.

• The date when the AI was prompted and the text output was retrieved.

• The prompt that was put into the AI.

• The actual text.

Lastly, on page XVIII, the order of the 32 data points in the collated data set are
presented, and on page XIX, the classifications made by the different AI algorithms
on the 32 data points are presented.

I


A. The Swedish pre-study

Below are the articles written by humans at SVT Nyheter presented in
plain text as they were put in the AI-algorithms.

H1 First article written by a human

Index Published Updated Retrieved
H1 2020-08-18 2020-08-18 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/har-sprangs-misstankt
a-granaten-i-karlstad

H2 Second article written by a human

Index Published Updated Retrieved
H2 2020-09-05 2020-09-07 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/misstankt-farligt-for
emal-vid-polishuset-i-kristinehamn

H3 Third article written by a human

Index Published Updated Retrieved
H3 2020-11-05 2020-11-20 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/personalbrist-pa-forlo
ssningen-trots-standiga-rekryteringsforsok

H4 Fourth article written by a human

Index Published Updated Retrieved
H4 2020-08-14 2020-08-19 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/kalle-91-jag-flyttar-i
nte-da-far-de-kora-mig-harifran-i-en-likkista

H5 Fifth article written by a human

Index Published Updated Retrieved
H5 2020-09-24 2020-09-24 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/granbarkborren-orsak
ade-skador-for-over-sju-miljarder

II

https://www.svt.se/nyheter/lokalt/varmland/har-sprangs-misstankta-granaten-i-karlstad
https://www.svt.se/nyheter/lokalt/varmland/har-sprangs-misstankta-granaten-i-karlstad
https://www.svt.se/nyheter/lokalt/varmland/misstankt-farligt-foremal-vid-polishuset-i-kristinehamn
https://www.svt.se/nyheter/lokalt/varmland/misstankt-farligt-foremal-vid-polishuset-i-kristinehamn
https://www.svt.se/nyheter/lokalt/varmland/personalbrist-pa-forlossningen-trots-standiga-rekryteringsforsok
https://www.svt.se/nyheter/lokalt/varmland/personalbrist-pa-forlossningen-trots-standiga-rekryteringsforsok
https://www.svt.se/nyheter/lokalt/varmland/kalle-91-jag-flyttar-inte-da-far-de-kora-mig-harifran-i-en-likkista
https://www.svt.se/nyheter/lokalt/varmland/kalle-91-jag-flyttar-inte-da-far-de-kora-mig-harifran-i-en-likkista
https://www.svt.se/nyheter/lokalt/varmland/granbarkborren-orsakade-skador-for-over-sju-miljarder
https://www.svt.se/nyheter/lokalt/varmland/granbarkborren-orsakade-skador-for-over-sju-miljarder


A. The Swedish pre-study

H6 Sixth article written by a human

Index Published Updated Retrieved
H6 2020-09-07 2020-09-08 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/carolas-son-ett-av-nar
kotikans-offer

H7 Seventh article written by a human

Index Published Updated Retrieved
H7 2020-12-15 2024-02-12 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/klart-svenska-rallyt-s
talls-in

H8 Eighth article written by a human

Index Published Updated Retrieved
H8 2020-10-06 2020-10-06 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/sveriges-forsta-laxka
non-i-drift-i-forshaga

H9 Ninth article written by a human

Index Published Updated Retrieved
H9 2020-03-15 2020-03-15 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/smittskyddslakaren-v
iktigt-att-begransa-hur-vi-umgas

H10 Tenth article written by a human

Index Published Updated Retrieved
H10 2020-04-21 2020-04-22 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/storforspolitiker-akt
e-till-stockholm-for-att-fa-corona

H11 Eleventh article written by a human

III

https://www.svt.se/nyheter/lokalt/varmland/carolas-son-ett-av-narkotikans-offer
https://www.svt.se/nyheter/lokalt/varmland/carolas-son-ett-av-narkotikans-offer
https://www.svt.se/nyheter/lokalt/varmland/klart-svenska-rallyt-stalls-in
https://www.svt.se/nyheter/lokalt/varmland/klart-svenska-rallyt-stalls-in
https://www.svt.se/nyheter/lokalt/varmland/sveriges-forsta-laxkanon-i-drift-i-forshaga
https://www.svt.se/nyheter/lokalt/varmland/sveriges-forsta-laxkanon-i-drift-i-forshaga
https://www.svt.se/nyheter/lokalt/varmland/smittskyddslakaren-viktigt-att-begransa-hur-vi-umgas
https://www.svt.se/nyheter/lokalt/varmland/smittskyddslakaren-viktigt-att-begransa-hur-vi-umgas
https://www.svt.se/nyheter/lokalt/varmland/storforspolitiker-akte-till-stockholm-for-att-fa-corona
https://www.svt.se/nyheter/lokalt/varmland/storforspolitiker-akte-till-stockholm-for-att-fa-corona


A. The Swedish pre-study

Index Published Updated Retrieved
H11 2020-03-16 2020-03-23 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/det-ar-nastan-varre-a
n-under-kriget

H12 Twelfth article written by a human

Index Published Updated Retrieved
H12 2020-04-08 2020-04-08 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/inga-norrman-i-varml
and-i-pask-tomt-i-stugor-och-husvagnar

H13 Thirteenth article written by a human

Index Published Updated Retrieved
H13 2020-04-15 2020-04-15 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/mot-personalen-som-v
ardar-coronapatienter

H14 Fourteenth article written by a human

Index Published Updated Retrieved
H14 2020-09-15 2020-09-15 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/isoleringen-borjar-kan
nas-trostlos-for-vara-aldre

H15 Fifteenth article written by a human

Index Published Updated Retrieved
H15 2020-06-05 2020-06-05 2024-02-22

Source: https://www.svt.se/nyheter/lokalt/varmland/lang-vag-till-aterham
tning-for-johan-efter-covid-19

H16 Sixteenth article written by a human

Index Published Updated Retrieved
H16 2020-12-06 2020-12-07 2024-02-22

IV

https://www.svt.se/nyheter/lokalt/varmland/det-ar-nastan-varre-an-under-kriget
https://www.svt.se/nyheter/lokalt/varmland/det-ar-nastan-varre-an-under-kriget
https://www.svt.se/nyheter/lokalt/varmland/inga-norrman-i-varmland-i-pask-tomt-i-stugor-och-husvagnar
https://www.svt.se/nyheter/lokalt/varmland/inga-norrman-i-varmland-i-pask-tomt-i-stugor-och-husvagnar
https://www.svt.se/nyheter/lokalt/varmland/mot-personalen-som-vardar-coronapatienter
https://www.svt.se/nyheter/lokalt/varmland/mot-personalen-som-vardar-coronapatienter
https://www.svt.se/nyheter/lokalt/varmland/isoleringen-borjar-kannas-trostlos-for-vara-aldre
https://www.svt.se/nyheter/lokalt/varmland/isoleringen-borjar-kannas-trostlos-for-vara-aldre
https://www.svt.se/nyheter/lokalt/varmland/lang-vag-till-aterhamtning-for-johan-efter-covid-19
https://www.svt.se/nyheter/lokalt/varmland/lang-vag-till-aterhamtning-for-johan-efter-covid-19


A. The Swedish pre-study

Source: https://www.svt.se/nyheter/lokalt/varmland/efter-avgorande-mat
chen-gladje-lattnad-det-ar-allting

V

https://www.svt.se/nyheter/lokalt/varmland/efter-avgorande-matchen-gladje-lattnad-det-ar-allting
https://www.svt.se/nyheter/lokalt/varmland/efter-avgorande-matchen-gladje-lattnad-det-ar-allting


A. The Swedish pre-study

Below are the texts created by AI through ChatGPT 3.5 presented in
plain text as they were put in the AI-algorithms.

A1 First text created by an AI

Index Pooled to Retrieved
A1 H1 2024-02-23

Prompt: Skriv en kort nyhetsartikel med rubriken ”Här sprängs
misstänkta granaten i Karlstad”.

Här sprängs misstänkta granaten i Karlstad

Under tidiga morgontimmar idag inträffade en kraftig explosion i centrala
Karlstad, vilket sände chockvågor genom det lugna samhället. Den misstänkta
explosionen, som polisen bekräftar vara relaterad till en granat, skedde vid en
park i närheten av ett bostadsområde.

Polisen uppger att ingen skadades i explosionen, men skador på omkringlig-
gande egendom har noterats. Räddningstjänsten och polisen arbetar nu på
platsen för att säkra området och utreda händelsen.

En talesperson för polisen uppmanar allmänheten att vara försiktig och rap-
portera alla misstänkta föremål till myndigheterna omedelbart. Det finns ännu
inga uppgifter om motivet bakom explosionen, men en utredning är nu igång
för att fastställa omständigheterna kring händelsen.

Invånarna i Karlstad uppmanas att vara vaksamma och samarbeta med myn-
digheterna för att säkerställa tryggheten i staden. Polisen kommer att fort-
sätta sin närvaro i området under dagen medan utredningen pågår.

A2 Second text created by an AI

Index Pooled to Retrieved
A2 H2 2024-02-23

Prompt: Skriv en kort nyhetsartikel med rubriken ”Misstänkt
farligt föremål vid polishuset i Kristinehamn”.

Misstänkt farligt föremål vid polishuset i Kristinehamn

Under tisdagsmorgonen upptäcktes ett misstänkt farligt föremål utanför pol-
ishuset i Kristinehamn. Polisen har inlett en utredning och avspärrningar har
satts upp runt området för att säkerställa allmänhetens säkerhet.

VI


A. The Swedish pre-study

Enligt initiala rapporter ska föremålet ha påträffats av en polispatrull som
genast vidtog säkerhetsåtgärder och larmade specialister för att undersöka
det misstänkta föremålet närmare.

Detaljer kring föremålets karaktär eller eventuellt motiv är ännu inte tillgäng-
liga då polisen fortsätter med utredningen. Invånare uppmanas att undvika
området kring polishuset tills situationen är under kontroll.

Polisen betonar vikten av allmänhetens samarbete och uppmanar personer
som har information om händelsen att kontakta dem omedelbart.

Fler uppdateringar förväntas ges allteftersom utredningen fortskrider och sit-
uationen utvecklas.

A3 Third text created by an AI

Index Pooled to Retrieved
A3 H3 2024-02-23

Prompt: Skriv en kort nyhetsartikel med rubriken ”Personalbrist
på förlossningen trots ständiga rekryteringsförsök”.

Personalbrist på förlossningen trots ständiga rekryteringsförsök

Trots ihärdiga rekryteringsinsatser kämpar förlossningsavdelningar runt om
i landet med en akut brist på personal. Den oroande situationen har fått
personalstyrkan att nå kritiska nivåer, vilket innebär ökad arbetsbelastning
för befintlig personal och försämrade vårdstandarder för förlossande kvinnor.

Enligt ledningen för flera förlossningsavdelningar har rekryteringsprocesserna
varit kontinuerliga och omfattande. Trots detta har det varit svårt att fylla
vakanserna. En av anledningarna till den fortsatta bristen på personal tros
vara den höga arbetsbelastningen och de stressfyllda arbetsförhållandena som
förknippas med yrket.

”Vi har gjort allt vi kan för att locka till oss kvalificerad personal, inklusive
annonsering, erbjudanden om bonusar och utbildningsincitament. Tyvärr har
vi ännu inte kunnat täcka behoven”, säger Anna Andersson, sjukhuschef vid
en av förlossningsavdelningarna.

Konsekve