DEPARTMENT OF TECHNOLOGY MANAGEMENT AND ECONOMICS DIVISION OF INNOVATION AND R&D MANAGEMENT CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se Shape the Future of Swedish Healthcare with AI-Technology How to Implement Large Language Models as a Tool to Streamline Clinicians' Administrative Tasks Master's Thesis in Management and Economics of Innovation, MSc AMANDA NACKOVSKA ELIN BERTHAG Shape the Future of Swedish Healthcare with AI-Technology How to Implement Large Language Models as a Tool to Streamline Clinicians' Administrative Tasks AMANDA NACKOVSKA ELIN BERTHAG Department of Technology Management and Economics Division of Innovation and R&D Management CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 Shape the Future of Swedish Healthcare with AI-Technology How to Implement Large Language Models as a Tool to Streamline Clinicians' Administrative Tasks AMANDA NACKOVSKA ELIN BERTHAG © AMANDA NACKOVSKA, 2024 © ELIN BERTHAG, 2024. Department of Technology Management and Economics Chalmers University of Technology SE-412 96 Gothenburg Sweden Telephone + 46 (0)31-772 1000 Gothenburg, Sweden 2024 Shape the Future of Swedish Healthcare with AI-Technology How to Implement Large Language Models as a Tool to Streamline Clinicians' Administrative Tasks AMANDA NACKOVSKA ELIN BERTHAG Department of Technology Management and Economics Chalmers University of Technology ABSTRACT The demand for healthcare is continuously increasing in line with an ageing population. It is a recognized problem that clinicians have a significantly high administrative workload as a consequence of digitalization, which takes time away from valuable direct patient care. Clinicians perform multiple text-based administrative tasks, and it can be argued that large language models (LLMs) have potential to streamline these tasks. In recent years, LLMs have received great attention, led by the public introduction of ChatGPT by OpenAI in November 2022. Thus, the purpose of this study is to explore how LLMs can be used to relieve the administrative text-based workload for clinicians at Sahlgrenska University Hospital. The study is delimited to look at patient-related administrative text-based tasks performed by physicians and nurses at the neurology, ophthalmology and radiology department at Sahlgrenska University Hospital. This study is based on a strong empirical surface built upon a wide data collection of 46 semi-structured interviews where 37 have been conducted with healthcare professionals at the three mentioned departments and 9 have been conducted with 10 experts within the field of AI in healthcare. In addition, data have been collected by distributing self-completion forms at the hospital to measure the time clinicians spend on certain administrative text-based tasks, and through field notes from a number of observations at the hospital. The obtained data has been analyzed through thematic analysis. The result of the study identifies that there is a vast potential to use LLMs to streamline patient-related administrative text-based tasks in healthcare. However, there are boundaries that need to be addressed. Technological concerns have been identified due to the novelty of the technology. Ethical concerns have been identified, mainly the risk that LLMs generate biased and incorrect information, and that the information can not be validated. The three practical cases of this study clearly show that there is a need to streamline the clinicians' patient-related administrative text-based tasks. While it can be concluded that there is potential to use LLMs for this, it should be noted that it has to be further researched in a practical setting. This research further concludes that there are clear differences in the clinicians' needs across the three different departments, which adds complexity to the process of prioritizing use cases to put into practice at the hospital. . Keywords: large language models, artificial intelligence, healthcare, administration, text-based tasks, patient-related. Acknowledgements This master’s thesis was conducted during the spring semester of 2024 as part of our studies in Management and Economics of Innovation at Chalmers University of Technology. The research study was performed in collaboration with several key people to whom we would like to express our gratitude. First, we would like to extend our heartfelt thanks to Andreas Hellström, our super- visor and examiner. Andreas has been an incredible source of guidance and support throughout this project. His positive attitude and willingness to share his expertise has been invaluable. We truly appreciate all his help and encouragement along the way. Second, we would like to extend our gratitude to our clients at Sahlgrenska Univer- sity Hospital: Erik Thurin, Maria Hansen and Robin Molander. Their leadership in initiating this project, coupled with their ongoing sharing of expertise, engagement, and passion for improving Swedish healthcare has been integral to the success of this project. Without you, this project would not have been possible. Third, we would like to extend our gratitude to our collaborating master’s the- sis project partners, Albert Lund and Felix Nilsson, as well as the talented team at Kompetenscentrum AI at Sahlgrenska University Hospital, led by Isak Barbopoulos. We deeply appreciate the invaluable teamwork and knowledge sharing between our two groups, which has been incredibly benecial to our project’s progress and results. Finally, we would like to sincerely thank the research participants, both all par- ticipating healthcare professionals from Sahlgrenska University Hospitals as well as all participating experts within the eld of AI in healthcare. Without the contribu- tion of each and every one of you, this project would not have been possible. Amanda Nackovska Elin Berthag Gothenburg, June 2024 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AI Articial Intelligence EHR Electronic Health Records IT Information Technology LLMs Large Language Models SU Sahlgrenska University Hospital ix List of Concepts Below is the list of concepts that have been used throughout this thesis listed in thematic order: Medical elds Neurology Focused on diagnosing and treating disorders of the nervous system, including the brain, spinal cord, and peripheral nerves. Ophthalmology Dedicated to the diagnosis, treatment, and preven- tion of eye and vision disorders. Radiology Specializing in imaging techniques to diagnose and treat diseases. Healthcare task types Non-patient-related Tasks that do not involve direct patient interac- tion, such as support services. Patient-related Tasks that involve direct interaction with and care for patients. Patient care types In-patient care Medical care provided to patients admitted to a hospital or clinic for an overnight stay or longer. Out-patient care Medical care provided to patients who visit a healthcare facility for diagnosis or treatment with- out staying overnight. xi Professional healthcare titles Clinician A general term for a healthcare professional in- volved in patient care. Medical secretary Supports medical sta by managing appointments, maintaining patient records, and performing other administrative duties. Nurse Provides patient care, administers medications, performs medical procedures, and educates pa- tients and their families. Nursing assistant Support nurses by assisting with basic patient care tasks such as bathing, dressing, and feeding pa- tients, and measuring vital signs. Physician Have completed medical school and residency training. Diagnoses and treats illnesses, prescribes medications, and may perform surgeries. Physiotherapist Helps patients improve their physical movement and manage pain through exercise, manual ther- apy, and other treatments. Resident A physician in training who has completed med- ical school and is undergoing specialized training in a particular eld under the supervision of senior physicians. Senior specialist A physician who has completed advanced training and has extensive expertise in a specic medical eld, often overseeing patient care and guiding ju- nior physicians. Surgeon A physician who specializes in performing surgical procedures to treat diseases, injuries, or deformi- ties, requiring extensive training in specic surgi- cal techniques. xii Dierent types of notes physicians write in the health records Out-patient care Clinicial note/visit note Documentation of new or follow-up visits, includ- ing symptoms, examinations and treatments. In-patient care Admission note Summary of the patient’s condition upon admis- sion to the hospital, including medical history, ex- aminations, and planned care. Daily progress note Daily documentation of the patient’s condition, treatments, and progress during their hospital stay. Discharge letter Summary of a patient’s care and treatment during their hospital stay, providing instructions for con- tinued care and recovery following discharge. Discharge summary Summary of the patient’s hospital stay, including diagnosis, treatments, and recommendations for continued care after discharge. Ward round notes Documentation of the patient’s condition, treat- ment plan, and signicant observations made each day of a patient’s hospital stay. Other notes Consultation note Documentation of a specialist’s assessment and recommendations after consulting another health- care provider regarding a patient. Operative report Detailed description of a surgical procedure, in- cluding the intervention, any complications, and postoperative care. Referral Document sent from one healthcare provider to an- other for the transfer of a patient to a specialist or for further investigation and care. xiii Dierent types of notes nurses write in the health records Out-patient care Clinicial notes/visit notes Documentation of new or follow-up visits, includ- ing symptoms, examinations and treatments. In-patient care Admission note Summary of the patient’s condition upon admis- sion to the hospital, including medical history, ex- aminations, and planned care. Discharge planning Coordination of the patient’s transition from the hospital to home or another care setting, including arrangements for follow-up care, medication man- agement, and patient education. Nursing assessment Documentation of a patient’s current health sta- tus, including vital signs, physical assessment nd- ings, and any nursing interventions required. Nursing discharge note Documentation of the patient’s condition at the time of discharge, including nal assessments, in- structions for ongoing care, and any referrals or follow-up appointments required. Risk assessment note Evaluation of potential risks to the patient’s health and safety, including assessment of fall risk, pres- sure ulcer risk, and risk of developing complica- tions. Ward round notes Documentation of the patient’s assessments, inter- ventions, and responses to treatment made each day of a patient’s hospital stay. Other notes Postoperative care note Documentation of the patient’s recovery and ongo- ing care following surgery, including assessments of wound healing, pain management, and monitoring for potential complications. Referral for cancer rehabili- tation Request to another healthcare provider or special- ist for a patient to receive specialized cancer reha- bilitation services. Referral for sutures A request to another healthcare provider or spe- cialist to have a wound closed with stitches (su- tures). xiv Contents List of Acronyms ix List of Concepts xi List of Figures xix List of Tables xxi 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theoretical Framework 4 2.1 The Development of Articial Intelligence and its Role in the Health- care Sector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 The Technology Behind LLMs . . . . . . . . . . . . . . . . . . 7 2.2.2 Training LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2.1 Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2.2 Retrieval-Augmented Generation . . . . . . . . . . . 9 2.2.2.3 Scaling Laws . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Challenges of LLMs . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.4 The Future of LLMs . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Ambient AI Scribes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 LLMs in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Ethics and Legislation of LLMs in Healthcare . . . . . . . . . . . . . 13 2.5.1 Medical Device . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Healthcare in Sweden . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6.1 Denition of Administrative versus Clinical Work in Health- care Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.2 Previous Studies on Time Utilization in Healthcare Settings . 19 2.6.3 Perspectives on the Administrative Burden . . . . . . . . . . . 20 2.6.4 Illegitimate Tasks: Conceptualization and Implications for Employees in the Healthcare Sector . . . . . . . . . . . . . . . 20 xv Contents 2.6.4.1 Illegitimate Tasks Impact on Well-Being . . . . . . . 22 3 Methods 23 3.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.2 Qualitative Interview . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.3 Sampling and Practicalities . . . . . . . . . . . . . . . . . . . 26 3.2.4 List of Interviewees . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.5 Diary as a Self-Completion Questionnaire . . . . . . . . . . . . 29 3.2.5.1 Sampling for the Diary as a Self-Completion Ques- tionnaire . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.6 Observations for Understanding and Learning . . . . . . . . . 30 3.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Research Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . 32 4 Results 34 4.1 Expert Perspectives Within the Area of AI in Healthcare . . . . . . . 35 4.1.1 Potential LLM Applications in Healthcare . . . . . . . . . . . 35 4.1.2 Exploring the Potential of LLMs at Region Halland . . . . . . 36 4.1.3 Risks, Benets and Future Directions . . . . . . . . . . . . . . 37 4.1.4 The Challenges of Validating LLMs . . . . . . . . . . . . . . . 40 4.1.5 Ethics and Legislation . . . . . . . . . . . . . . . . . . . . . . 41 4.1.5.1 Machines versus Clinicians . . . . . . . . . . . . . . . 41 4.1.5.2 Who Takes Responsibility for AI Generated Outputs? 42 4.1.5.3 Regulations . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.6.1 A Successful Case of Implementing AI Tools in Health- care in Canada . . . . . . . . . . . . . . . . . . . . . 46 4.1.6.2 Key Factors to Implement New Technology in Health- care . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 The Neurology Department . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 The Needs of Physicians at the Neurology Department . . . . 51 4.2.1.1 Everyday Text Based Tasks Performed by Neurologists 51 4.2.1.2 The Perspective of the Neurology Physicians on Their Text Based Tasks . . . . . . . . . . . . . . . . . . . . 51 4.2.1.2.1 Time-Consuming and Troublesome Tasks . . 51 4.2.1.2.2 Tools and Characteristics . . . . . . . . . . 55 4.2.1.3 Time Estimates for a Number of Text Based Tasks . 59 4.2.2 Self-Completion Form at the Neurology Department . . . . . . 60 4.2.3 The Needs of Nurses at the Neurology Department . . . . . . 62 4.2.3.1 Everyday Text Based Tasks Performed by Neurology Nurses . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.3.2 The Perspective of the Neurology Nurses on Their Text Based Tasks . . . . . . . . . . . . . . . . . . . . 62 4.2.3.2.1 Time-Consuming and Troublesome Tasks . . 62 xvi Contents 4.2.3.2.2 Tools and Characteristics . . . . . . . . . . 67 4.2.3.3 Time Estimates for a Number of Text Based Tasks . 68 4.3 The Ophthalmology Department . . . . . . . . . . . . . . . . . . . . 69 4.3.1 The Role of Physicians at the Ophthalmology Department . . 69 4.3.2 The Physicians’ View on the Administrative Burden at the Ophthalmology Department . . . . . . . . . . . . . . . . . . . 70 4.3.3 Task Shifting with Speech Recognition . . . . . . . . . . . . . 72 4.3.4 The Ophthalmology Physicians’ Perspectives on Patient-Related Administration . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.5 The Ophthalmology Physicians’ Perspectives on Their Com- puter Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 The Radiology Department . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 The Role of Nurses and Physicians at the Radiology Department 76 4.4.2 Administrative Tasks at the Radiology Department . . . . . . 77 4.4.3 Deep-Dive into Administration Related to Referrals . . . . . . 77 4.4.4 Suggestions to Make the Referral Management More Ecient 80 4.4.5 Other Administrative Tasks and Bottlenecks at the Radiology Department . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.6 Time Spent on Administrative Tasks . . . . . . . . . . . . . . 81 5 Discussion 83 5.1 The Potential and Risks of LLMs in Healthcare . . . . . . . . . . . . 84 5.2 The Cases of the Three Departments . . . . . . . . . . . . . . . . . . 85 5.2.1 The Case of the Neurology Department . . . . . . . . . . . . . 85 5.2.2 The Case of the Ophthalmology Department . . . . . . . . . . 89 5.2.3 The Case of the Radiology Department . . . . . . . . . . . . . 92 5.2.4 Comparison Between the Three Departments . . . . . . . . . . 95 5.3 The Impact of Administrative Work on Clinicians’ Health . . . . . . . 96 5.4 The Ethical and Legal Perspective of LLMs in Healthcare . . . . . . . 97 5.5 How to Implement LLMs in Healthcare . . . . . . . . . . . . . . . . . 99 5.5.1 What to Learn from the Case of Unity Health Toronto . . . . 100 5.5.2 What to Think About When Implementing New Technology in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Conclusion 102 7 Recommendations 104 8 Future Research 105 Bibliography 106 A Appendix 1 I A.1 Interview Guide for Clinician Interviews . . . . . . . . . . . . . . . . I A.2 Interview Guide for Expert Interviews . . . . . . . . . . . . . . . . . II A.3 Self-Completion Forms . . . . . . . . . . . . . . . . . . . . . . . . . . III xvii Contents xviii List of Figures 3.1 Data collection, sampling and data analysis methods. . . . . . . . . . 25 5.1 Visualization of the dierent needs across departments. . . . . . . . . 96 A.1 Self-completion form for physicians at the neurology department, sec- ond update. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.2 Self-completion form for physicians at the neurology department, rst update. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V A.3 Self-completion form for nurses at the neurology department. . . . . . VI A.4 Self-completion form for physicians at the ophthalmology department. VII A.5 Self-completion form for physicians at the radiology department. . . . VIII xix List of Figures xx List of Tables 2.1 Examples of direct patient work, patient-related administration, and non-patient-related administration. . . . . . . . . . . . . . . . . . . . 19 3.1 List of clinician interviews. . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 List of expert interviews. . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Estimated time physicians spend on administrative work in average on a normal day of clinical work at the neurology department. . . . . 52 4.2 Time estimates by neurology physicians for reading EHR and dierent types of writing tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Data from the self-completion form at the neurology department, oc- casion 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Data from the self-completion form at the neurology department, oc- casion 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Mean values of self-completion form data. . . . . . . . . . . . . . . . 61 4.6 Estimated time nurses spend on administrative work in average on a normal day of clinical work at the neurology department. . . . . . . . 63 4.7 Time estimates by neurology nurses for reading EHR. . . . . . . . . . 68 4.8 Time estimates by neurology nurses for dierent writing tasks. . . . . 69 4.9 Estimated time physicians spend on administrative work in average on a normal day of clinical work at the ophthalmology department. . 71 4.10 Estimated time physicians spend on administrative work in average on a normal day of clinical work at the radiology department. . . . . 82 5.1 A selection of administrative tasks performed by radiologist and ra- diology nurses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 xxi List of Tables xxii 1 Introduction This chapter provides a background for the studied issue of this paper. It further presents the purpose of this study and two research questions, as well as the delim- itations of this study. 1 1. Introduction 1.1 Background Numerous studies have shown that Swedish clinicians spend a signicant portion of their time on administrative tasks (Anskär, 2019; Borelius, 2009; McKinsey & Company, 2019), which detracts from the time available for essential direct patient care (Barkman & Aasa, 2023). When it comes to the patient-related administration, such as reading electronic health records, pre-patient meetings, and writing med- ical notes such as referrals, discharge letters, and admission letters, a study from 2019 showed that physicians and nurses spend 32.3 % respectively 27.2 % on such tasks. Meanwhile 34.4 % respectively 42.2 % of the clinicians’ time were spent on direct patient care (Anskär, 2019). In addition to taking time from critical clinical work, studies have shown that unnecessary and unreasonable administrative tasks are positively correlated with burnout and negatively impact job satisfaction, which can subsequently aect the quality of healthcare (Semmer et al., 2019; Werdecker & Esch, 2021). Furthermore, the Swedish healthcare system is projected to experience increased demands due to an aging population and a shrinking recruitment base (Ahlstedt et al., 2023; Barkman & Aasa, 2023; Grant Thornton, 2023). Given these challenges, there is a critical need to explore solutions that can alleviate the administrative burden on clinicians. Implementing AI-technologies, particularly large language models, oers a promising avenue to streamline administrative tasks and enhance operational eciency in healthcare settings (Ilicki; DiGiorgio & Ehren- feld, 2023; Rosenberg et al., 2024; Zhang & Kamel Boulos, 2023; Zheng et al., 2023). However, the potential is also hindered by certain risks and barriers that need to be acknowledged and addressed, especially given the sensitive nature of healthcare (Haltaufderheide & Ranisch, 2023; Harrer, 2023). Moreover, successful deployment and implementation of AI-solutions in clinical settings require coordinated planning and collaboration among healthcare professionals, IT experts, researchers, data sci- entists, health service administrators, and policymakers (Dhalla et al., 2023). Sahlgrenska University Hospital is Sweden’s largest hospital, where over 17 000 co- workers take care of 350 000 patient every year (Sahlgrenska Universitetssjukhuset, 2023b). With the goal of becoming Europe’s leading hospital by 2032, SU has cre- ated a strategy that partly emphasizes advancements in medical developments, new technologies, and digitization. Moreover, in 2021, SU launched Kompetenscentrum AI to support the development and integration of AI in clinical practices (Sahlgren- ska Universitetssjukhuset, 2023a). In collaboration with Sahlgrenska University Hospital, Kompetenscentrum AI and Chalmers University of Technology, this paper is part of a larger project that focuses on how LLMs can be used to make the patient-related administration at Sahlgrenska more ecient. This paper focuses on identifying and addressing challenges related to the administration of physicians and nurses at three dierent departments: neu- rology, ophthalmology, and radiology. 2 1. Introduction 1.2 Purpose The purpose of this study is to explore how LLMs can be used to relieve the admin- istrative text-based workload for clinicians at Sahlgrenska University Hospital. 1.3 Research Questions The purpose of this study is broken down into two research questions presented below: 1. How can LLMs be used as a tool for administrative tasks in healthcare? 2. What are the boundaries to use LLMs on the administrative side of healthcare? 1.4 Delimitations This research examines the healthcare at Sahlgrenska University Hospital in Gothen- burg, Sweden. Due to available resources, data collection is limited to three dierent departments: neurology, ophthalmology and radiology. The study is delimited to only examine the patient-related administrative text-based work of physicians and nurses at the three mentioned departments. 3 2 Theoretical Framework This chapter delves into the research perspective of large language models in health- care. The chapter is divided into multiple sections. First, it describes the develop- ment of articial intelligence and its role in the healthcare sector. Second, it gives a comprehensive understanding of the LLM technology followed by a description of ambient scribes. Third, research on LLMs in healthcare is presented. Fourth, the ethical and legal research perspective on LLMs is described. Lastly, the topic of healthcare in Sweden and the denition of administrative work in a clinical setting is presented. 4 2. Theoretical Framework 2.1 The Development of Articial Intelligence and its Role in the Healthcare Sector AI is a rapidly evolving domain within computer science (Alowais et al., 2023). It refers to the development of computer systems capable of performing tasks that typically require human intelligence like reasoning, perception and decision making (Dave & Patel, 2023). The eld of AI incorporates techniques such as machine learning (ML) and deep learning (DL) (Alowais et al., 2023). While AI is a broad eld that includes techniques to make machines intelligent, ML is a subset of AI that involves systems which can autonomously learn. DL is further nestled within ML, utilizing deep neural network models to discern patterns with minimal human intervention. The domain of AI has undergone signicant changes over the years (Alowais et al., 2023). The term “Articial Intelligence” was rst introduced during the Dartmouth Conference organized by John McCarthy in 1956. Following this event, AI research focused on expert and rule-based systems, an approach that was limited by the need for more data and computing power. Medicine was identied as one of the most promising areas to use AI in the mid-twentieth century (Yu et al., 2018). In the 1970s, researchers developed and proposed many clinical decision support systems and rule-based approaches were developed successfully. However, rule-based systems are costly to build and require explicit expressions of decision rules, comprehensive- ness of prior medical knowledge and human-authored updates. These systems are also limited by their diculty to encode higher-order interactions between dierent pieces of knowledge that have been authored by dierent experts. It was not until the 1980s and 1990s that AI research pivoted towards machine learning (ML) and neural networks (Alowais et al., 2023). This enabled machines to learn from data and improve their performance gradually. The shift carved a path for new system developments, like IBM’s Deep Blue that defeated the world chess champion, Garry Kasparov, in 1997. Unlike the early rule-based AI system, new AI systems that leveraged machine learning methods made it possible for AI systems to discover patterns without the need to specify decision rules for each specic task or to account for complex interactions among input features (Yu et al., 2018). These achievements have been driven mainly by successful application of deep learning. AI research continued to evolve in the 2000s (Alowais et al., 2023). During that time, the research was focusing on computer vision and natural language processing (NLP). NLP is a subset of AI dedicated to train machines to interpret, comprehend and generate natural human language. This focus resulted in the the development of virtual assistants capable of comprehending and reacting to natural language re- quests, like Apple’s Siri. AI continues to evolve still, and today it is transforming large sectors in society like healthcare, transportation and nance, among other elds (Alowais et al., 2023). 5 2. Theoretical Framework Most recently, LLMs have made signicant advancements in NLP (Kasneci et al., 2023). LLMs are a type of generative AI, which means that it can generate new content (Fruhlinger, 2023). LLMs are trained on extensive text datasets and can perform language-related tasks, such as generating human-like text and accurately answering questions. (Kasneci et al., 2023). Utilization of transformer architectures that use a self-attention mechanism to determine the relevance of dierent aspects of inputs when generating predictions and pre-training on large datasets combined with ne-tuning on specic tasks have been key to the most recent successes of LLMs. In 2022, the impact of generative AI exploded into the public consciousness (Bergmann, 2024). The LLM-chatbot ChatGPT by OpenAI was a major source of this breakthrough (Sžtra, 2023). The technology was made openly and freely available, having reached a level of maturity that made it immediately accessible and useful for multiple users. In 2023 generative AI became gradually incorporated into the business landscape (Bergmann, 2024). The accelerating pace of change in the domain of AI makes it challenging for policy makers to promote innovation while establishing safeguards (Anderson & Suther- land, 2024). This balance is particularly challenging to manage in healthcare, where there are signicant opportunities to use AI, but also considerable risks. AI can transform healthcare by helping to discover cures for diseases such as cancer, en- hancing diagnostic accuracy, speed, and access, and providing tools to support clin- icians. However, AI in healthcare is faced with signicant risks connected to ethical concerns, algorithmic biases and data privacy and security. Close collaboration be- tween healthcare organizations, regulatory bodies and AI researchers are crucial to establish standards and guidelines for AI algorithms and their application in clinical decision making (Alowais et al., 2023). According to Anderson and Sutherland (2024), there are cases where the risk of using AI in healthcare is low and the predicted benets are high. One such area is automating tasks for clinicians with AI-applications to free more time to patient care. These represent potential areas of initial focus, and are estimated to reduce the administrative burden for clinicians by 10-30 %. 2.2 Large Language Models As outlined above, LLMs represent a groundbreaking advancement in the eld of AI (Hadi et al., 2023). LLMs are demonstrating remarkable capabilities in understand- ing and generating human-like language across a broad spectrum of tasks. Powered by deep learning algorithms, these models exhibit an extraordinary capacity to un- derstand, generate, and interact with human language, based on the vast amount of data that is used to train them (Hadi et al., 2023; IBM, n.d.). Since LLMs have the ability to generate text, they are frequently classied un- der the category of generative articial intelligence (GenAI) (Raschka, 2023). Re- cent advancements in LLMs bolster condence in their ability to address numerous challenges across real-world settings (Zhao et al., 2023). However, given the rapid 6 2. Theoretical Framework advancements in this eld, pinpointing unresolved challenges and areas of potential success remains a complex task (Kaddour et al., 2023). To gain a fundamental un- derstanding of how LLMs work, this section provides an introductory background on LLMs, encompassing the technology behind LLMs, important concepts regarding how to train LLMs, a selection of recognized challenges and concludes with a future outlook. 2.2.1 The Technology Behind LLMs The evolution process of language models has greatly expanded the range of tasks they can perform and markedly improved their performance (Zhao et al., 2023). The latest models, such as GPT-4, are engineered to tackle complex, real-world tasks. The signicant breakthrough in the eld of LLMs was marked by the introduction of the Transformer architecture, detailed by Vaswani et al. (2017) in their paper titled "Attention is All You Need". Transformer is a type of neural network architecture, and has become the basis of LLMs. This architecture can contain trillions of param- eters and be trained on several petabytes of textual data, forming the foundation of the model’s knowledge base (Harrer, 2023; Liu et al., 2024; Raschka, 2023; Zhao et al., 2023). To illustrate the vastness of a single petabyte, consider that it can store over 200 million 5 MB photos. The Transformer architecture is mainly composed of two parts: an encoder and a decoder, along with the attention mechanism that operates within these parts (Liu et al., 2024). The encoder-decoder structure involves an encoder that trans- forms input into a context vector, and a decoder that takes this context vector and generates an output (Ghojogh & Ghodsi, 2020). This process ensures that the out- put is connected to the input via the context vector within a concept called the hidden space. Furthermore, the attention mechanism concentrates on key details within a broad dataset, sidelining less signicant information. A variant of the at- tention mechanism is self-attention mechanism, which is the core idea behind the transformer architecture (Wang et al., 2023). The self-attention mechanism excels at processing input data across extensive ranges, enhancing local attention through parallel computing (Wang et al., 2023). It grasps diverse spatial information representations by considering the position and mean- ing of each word in the input. Moreover, the self-attention mechanism could be extended to multi-head self-attention, extracting hidden aspects of the data from every direction and angle (Liu et al., 2024). Multi-head self-attention runs multiple self-attention processes at once, with each focusing on dierent parts of the input to uncover various patterns (Ghojogh & Gh- odsi, 2020). This enables a deeper understanding of both close and distant elements in the input, enriching the model’s grasp of complex word interactions (Liu et al., 2024). 7 2. Theoretical Framework In practice, this means that transformer-based models have the capability to an- alyze the full context of a sentence or whole documents and produce text that is both coherent and contextually appropriate, transforming elds such as chatbots, text summarization, and language translation (Hadi et al., 2023; IBM, n.d.). Thus, the multi-head self-attention mechanism, encoder and decoder all play crucial roles in how well the LLM performs, tackles downstream tasks as those previously men- tioned as well as handles issues such as long-range dependencies and more (Ghojogh & Ghodsi, 2020; Liu et al., 2024; Wang et al., 2023). 2.2.2 Training LLMs LLMs are incredibly exible, meaning that one single model could perform a large variety of tasks related to text (Hadi et al., 2023). However, achieving superior performance and ensuring quality in specic application domains involve various complex and active areas of research regarding dierent techniques to enhance the models’ eciency. As previously highlighted, the introduction of Chat-GPT represented a signicant milestone of LLMs and revolutionized the AI landscape. However, ChatGPT is not available as an open-source tool and its primary usage is conned to interactions through OpenAI’s website or its API (Liu et al., 2024). Consequently, there is a growing need for the development of alternative LLMs or those tailored to specic domains to provide alternatives to ChatGPT. Choosing the right training approach is crucial for the success of a language model and training them necessitates extensive collections of textual data, with the data’s quality playing a crucial role in determining the performance of LLMs (Raschka, 2023). The process of training LLMs unfolds in three primary stages (Liu et al., 2024). Initially, it starts with the collection and preparation of data. The next step is the pre-training process, containing selection of architecture and pre-training tasks, as well as applying appropriate algorithms for parallel training to accomplish the training process. The nal step includes ne-tuning and alignment, a concept that will be further explained below. 2.2.2.1 Fine Tuning In the rise of LLMs, many organizations are hopeful about grasping the potential to make their operations more ecient (Bowman, 2023; Nabwani, 2023; Shanahan, 2023). The potential of LLMs is signicant, yet they may not always fulll spe- cic requirements directly after initial pre-training (Nabwani, 2023). This is due to several factors including that LLMs may lack tailored outputs necessary for certain applications. They might also miss crucial context, such as specic documents or instruction manuals. A LLM might not understand the context needed in clinical settings if it has never been trained on medical documents, for example (Singhal et al., 2023). Additionally, LLMs often struggle with specialized vocabulary unique to certain 8 2. Theoretical Framework industries or enterprises, making it dicult for them to eectively summarize or answer questions about complex topics like nancial data or medical research (Nab- wani, 2023). This is when ne-tuning comes in. Fine-tuning a LLM involves adjust- ing a pre-trained model to perform better on specic tasks or to adapt to particular types of data (Devlin et al., 2019; Rael et al., 2019). This process is essential because while a general pre-trained model can handle a broad range of topics, ne- tuning can signicantly improve its performance and accuracy in specialized areas. LLMs are statistical models that predict sentences of words based on probability dis- tribution (Ramlochan, 2023). By ne tuning these models, they are exposed to new word relationships and structural conventions specic to a target task or domain. 2.2.2.2 Retrieval-Augmented Generation As an alternative to full ne-tuning, Retrieval-Augmented Generation (RAG) serves as an innovative method to enhance LLMs by integrating prompt engineering with database querying, enabling the generation of context-rich answers (Gao et al., 2024; Nabwani, 2023). This approach not only oers a cost-eective solution to improve the relevance, and applicability of LLM outputs across diverse contexts (Gao et al., 2024; Nabwani, 2023) but also signicantly reduces the production of factually incor- rect content by leveraging external databases (Gao et al., 2024; Lewis et al., 2020; Nabwani, 2023). RAG is particularly eective for applications requiring tailored responses from extensive, context-specic document sets (Nabwani, 2023). Addi- tionally, it proves invaluable when LLMs need to incorporate the latest updates, such as recent news or medical research ndings not included in the initial training data, ensuring that responses remain current and factually accurate. 2.2.2.3 Scaling Laws A key insight in pre-training is that scaling plays a vital role in the ecient de- velopment and deployment of LLMs, making the establishment of a quantitative method to measure the scaling eect invaluable (Zhao et al., 2023). The relation- ship between scale and model performance, also called scaling law, have the ability in most cases to predict a continued increase in certain capabilities as models get larger (Ganguli et al., 2022). Scaling laws enable precise predictions of how future models’ capabilities will evolve as they are scaled up in three dimensions: the volume of data they are trained on, their size in terms of parameters, and the computational eort (measured in FLOPs) utilized in their training (Bowman, 2023). This predictive capacity facilitates key design decisions, such as determining the optimal model size within a xed resource budget, thereby avoiding the need for costly trial and error. For transformer language models, the OpenAI team presented in 2020 (Kaplan et al., 2020) that performance of a language model improves as the parameters model size, data and compute appropriately is scaled up in tandem making the conclusion that larger models perform better. Moreover, the Google DeepMind team presents their investigation of the optimal model size and number of training tokens for a 9 2. Theoretical Framework transformer language model in 2022 (Homan et al., 2022), reaching a slightly dif- ferent conclusion. While both research groups concur on the choice of parameters, Kaplan et al. (2020) advocate for allocating a larger portion of the budget to model size over data, whereas Homan et al. (2022) contend that model size and data volume should be scaled up proportionately. Despite diering theories on the ideal balance, there’s unanimous agreement among them and other researchers that the computational and energy costs for training LLMs are considerable. Consequently, precisely identifying the optimal model hyperparameters for a specic computational budget becomes crucial (Homan et al., 2022; Kaplan et al., 2020; Zhao et al., 2023). While pre-trained LLMs deliver remarkable performance across a range of tasks, they are not universally applicable to every specic need. When encountering tasks beyond the capabilities, one solution is to ne-tune the model. As mentioned, ne- tuning involves adjusting the model’s parameters to better suit new, specic objec- tives, aiming to tailor the model for particular tasks (Ganguli et al., 2022). This process typically uses more focused datasets to rene the model’s abilities for specic uses (Hadi et al., 2023). However, just as training LLMs from scratch with exten- sive sequences presents computational hurdles, ne-tuning an already pre-trained LLM can also be a signicant investment (Chen et al., 2024). However, various techniques for specifying LLM to specic application areas has been an attractive area of research. 2.2.3 Challenges of LLMs Probably the most widespread challenges of LLMs today is the concept of hallucina- tions (Kaddour et al., 2023). Hallucinations of LLMs imply factual errors, incorrect or misleading outputs provided by the model (Zhao et al., 2023). This notable constraint frequently stems from the model’s eorts to bridge gaps in knowledge or context by making assumptions grounded in the patterns it absorbed during its training phase (Hadi et al., 2023). As one of the most well-known implications it is also a topic of active research, with hypothesis that the problem is a result of the model’s training process, dataset, and architectural design. Hallucinations sub- stantially limits how LLMs can be responsibly used (Bowman, 2023), especially in certain high-risk elds (Li et al., 2023). However, recent research suggest that a solution to mitigate hallucinations could be achieved in a near future, by the LLMs own internally capability to track which statements are true with reasonably high precision (Bowman, 2023). Another common implication of LLMs is that the models can demonstrate bias, when the training data the model is developed on is biased (Hadi et al., 2023). A statement well aligned with Baeza-Yates (2016) quote “the output quality of any algorithm is a function of the quality of the data that it uses”. Viswanath and Zhang (2023) has listed a detailed quantitative analysis of various biases, including those related to race, gender, ethnicity, age, and more, which has been displayed by well-known LLMs as BERT and GPT-2. Well-known biases like these, related to minorities and disadvantage groups, are all examples of potential harmful con- 10 2. Theoretical Framework tent that need to be avoided (Navigli et al., 2023). To eliminate biases and other toxic content, it is important to lter low-quality text in the data preprocessing step (Liu et al., 2024). This could be done by adopting content moderation techniques, using methods like sentiment analysis, hate speech detection, and using specialized bias identication algorithms. The signicant challenge posed by biases in LLMs is crucial to recognize, given the documented mistakes in critical sectors like law, medicine, and recruiting (Navigli et al., 2023). Furthermore, as real-world information evolves, the knowledge embedded in LLMs can become obsolete or inaccurate (Zhao et al., 2023). According to Kaddour et al. (2023), with current methods, refreshing the outdated knowledge in models with new pre-training data proves to be expensive, and the task of removing old data while adding new information during the ne-tuning stage is complex. Thus, it is vital to explore cost-ecient approaches, such as various tuning techniques, to incorporate updated content into LLMs (Zhao et al., 2023). This challenge is cru- cial to manage due to it limiting the applicability of LLMs on real-world use cases, particularly where single outdated information can imply large implications for the end-user (Kaddour et al., 2023). Another large implication of LLMs is the technology’s lack of a human-like under- standing of real-world problems (Ilicki, 2023a), where it is required strong reasoning skills and expert domain knowledge (Liévin et al., 2022). 2.2.4 The Future of LLMs Future LLMs carry high expectations, with predictions pointing towards a swift enhancement in the technology’s learning capabilities and overall performance (Liu et al., 2024). As one of the emerging trends, researchers believe the development will move towards ne-tuning LLMs against specic industries and real-world appli- cations (Liu et al., 2024; Stringhi, 2023). Interestingly, there is a belief among some researchers that LLMs designed for specialized tasks or specic elds may surpass the modeling capabilities of general-purpose LLMs, such as ChatGPT (Raschka, 2023). However, it is then crucial that researchers in the eld of AI will work in close collaboration with various sectors and professionals from diverse elds (Liu et al., 2024). By pooling expertise from various elds, joint endeavors will tackle challenges and explore potential solutions in a unied manner. Furthermore, it is crucial to clearly dene the LLMs’ application scope (Stringhi, 2023) and how to tackle it. For instance, there is a debate among researchers about the appropri- ateness of using LLMs in decision-making processes. Moreover, some researchers emphasize the necessity of human oversight for the activities of LLMs at all times, particularly in sensitive areas such as healthcare and law (Guo et al., 2023; Stringhi, 2023). 11 2. Theoretical Framework 2.3 Ambient AI Scribes Another promising AI tool that could be used in a clinical setting is ambient AI scribes. This technology utilize machine learning to analyze conversations, oering real-time, scribe-like capabilities (Tierney et al., 2024). It employs automatic speech recognition technology to extract and analyze relevant medical information and its context from patient-clinician dialogues. Employing a smartphone microphone, it transcribes encounters as they happen without retaining any audio recordings. This system then automatically generates electronic health record notes in near real-time (Crampton, 2019; Tierney et al., 2024). Thus, this technology has the potential to signicantly reduce the documentation burden for healthcare sta and enhances physician-patient interactions. However, to the best knowledge of the researchers conducting this study, there is a limited number of up-to-date scientic articles on this technology’s performance, making it dicult to assess its eectiveness. In Tierney et al. (2024)’s study, the au- thors conclude that despite the early promises of this technology, careful and ongoing attention is necessary to ensure that it supports clinicians eectively. Additionally, optimizing ambient AI scribe outputs for accuracy, relevance, and alignment within the physician–patient relationship is crucial (Tierney et al., 2024). 2.4 LLMs in Healthcare The recent focus on LLMs has caused unprecedented discussion of their potential application in various elds, including healthcare (Reddy, 2023). LLMs have revolu- tionized natural language processing (NLP), and state-of-the-art models like PaLM2 and GPT-4 have gained signicant attention (Thirunavukarasu et al., 2023). LLMs have the potential to improve the eectiveness and eciency of clinical, research and educational work in medicine. LLMs like ChatGPT use are described to have the potential to provide numerous benets both for clinicians and patients (Zheng et al., 2023). Among these are to greatly improve the eciency and accuracy of diagnosis and treatment plans, immediate 24/7 answers on healthcare questions, streamlining administrative tasks like billing and scheduling and support radiology departments with image analysis and interpretation. In one recent study conducted in cooperation between actors is Sweden and Switzer- land, orthopedic discharge summaries and letters were generated from ChatGPT-4 and orthopedic physicians which were evaluated blindly and then compared (Rosen- berg et al., 2024). The study’s result presented that both ChatGPT-4 and physician- generated notes were comparable in quality, but that ChatGPT-4 generated dis- charge documents ten times faster than the physicians did. This was a pilot study consisting of six cases, inviting further research and underlining that it exists techni- cal and legal barriers before ChatGPT-4 could be used in clinical application using patient data, but the result indicates that signicant gains can be realized in terms of time by utilizing LLMs in healthcare. 12 2. Theoretical Framework There are several issues and limitations preventing clinical deployment of LLMs like ChatGPT (Thirunavukarasu et al., 2023). Even though current applications have shown considerable potential in performing human-capable tasks, LLMs have also demonstrated signicant drawbacks like falsifying data and generating misinfor- mation (Reddy, 2023). These are concerning aspects in general, but in the context of healthcare, they can be even more severe. As LLMs are explored for utility in healthcare, including providing medical advice, interpreting health records and gen- erating discharge summaries, it is necessary that safeguards are ensured around their use. Even if AI-enabled tools can not diagnose and treat diseases today, DiGiorgio and Ehrenfeld (2023) argue that LLMs are ready to tackle many administrative tasks. It will require regulatory oversight, but less than for algorithms that recommend treatments, make diagnoses or otherwise impact clinical decision making. The re- searchers argue that the healthcare industry should reach for the low-hanging fruit in improving the eciency for physicians, de-tethering clinicians from the computer. Being aware of the risks and limitations to implement LLMs in healthcare, (Ilicki, 2023c) agrees that LLMs should be implemented to free time for physicians through automating documentation and thus, de-tethering clinicians from the computer. 2.5 Ethics and Legislation of LLMs in Healthcare AI’s potential to revolutionize healthcare delivery is widely acknowledged and the development of pre-trained LLMs has opened new possibilities for generating medical content and support clinicians with decision making, diagnosis prediction and treat- ment options (Reddy, 2023). Since the launch of OpenAI’s AI-based conversational LLM ChatGPT in November 2022, research has acknowledged that promising appli- cations of ChatGPT can induce a paradigm shift in health practice Sallam (2023). However, the use of LLMs has raised multiple ethical concerns (Reddy, 2023). According to Harrer (2023), there are three core limitations of LLM-generated data. First, models that have been trained on a large corpus of internet data with limited ltering, like ChatGPT, have ingested as much fair as biased content, as much harm- less materials as harmful ones and as much facts as misinformation. This causes a risk of LLMs reproducing, disseminating or amplifying misinformation or problem- atic content. Second, the models do not know whether the material it produces contains misrepresentations, inappropriate content or falsehood or whether it tells the truth and they have no means to assess this by themselves, nor can they inform the user about this. Third, LLMs are probabilistic algorithms, meaning that when they are prompted multiple times with the same task or question, the model will return dierent responses, which can be dierent versions of previously problematic or wrong answers, or replacement of wrong answers with correct or improved ones and vice versa, or constitute dierent versions of previously correct replies or com- binations on that. This behavior poses a reproducibility and reliability problem, requiring continuous human oversight of the model’s output. 13 2. Theoretical Framework Zhang and Kamel Boulos (2023) describe how ChatGPT’s responses have shown a wide and unpredictable uctuation in quality and veracity, and the authors ar- gue that this unpredictability is the main barrier for successful adoption of LLMs in healthcare. Similar to the argument by Harrer (2023) above, Zhang and Kamel Boulos (2023) explain that users can not know when the model is going to return a good answer and when its answers are going to be misleading or wrong, and thus, can not know when the model can be trusted or not, especially when the user is not qualied enough to assess the quality, the completeness and accuracy, of a given response. Haltaufderheide and Ranisch (2023) support above arguments and state that the most distinctive concern with LLMs in healthcare is the tendency to pro- duce convincingly, but inaccurate content or harmful misinformation. Scientically plausible, but factually inaccurate answers provided by LLMs are a phenomenon called hallucinations (Sallam, 2023). In healthcare, inaccurate algorithms output risks resulting in harm to health through for example misdiagnoses or inappropriate recommendations of treatment (Petročnik et al., 2023). Another paramount concern regarding LLMs in healthcare is that LLMs risks per- petuating harmful racial, cultural and gender biases (Haltaufderheide & Ranisch, 2023). In healthcare, these models risk providing biased outputs inadvertently prop- agating social prejudices against already vulnerable groups (Petročnik et al., 2023). GenAI models are prone to various forms of bias depending on how they are trained (Zhang & Kamel Boulos, 2023). Biases are seen as a signicant source of harm (Haltaufderheide & Ranisch, 2023). Further, inputting patient data raises ethical questions regarding privacy, data secu- rity and condentiality, especially in relation to commercial and publicly available models like ChatGPT (Haltaufderheide & Ranisch, 2023). The propensity of LLMs to disseminate patient data or other sensitive health information is a serious privacy concern. Personal health data is protected by the General Data Protection Regula- tion (GDPR) Article 9 (European Union, 2016). Privacy concerns have further been raised in relation to ChatGPT’s training data and code being kept secret, regarding its collection and storage of personal user data to train the model further (Zhang & Kamel Boulos, 2023). A need for transparency is evident, and the EU AI Act includes transparency requirements for GenAI, in- cluding a need that summaries of copyrighted data used for training is published. Transparency, together with explainability, is especially important in healthcare for operators to be able to validate LLMs performances (Harrer, 2023). The “black box" nature of generative AI models raises questions about the interpretability of the output they generate and calls for explainability and transparency (Zhang & Kamel Boulos, 2023). In the EU AI Act by the European Union (2024), it is stated that AI systems that proles individuals are always considered high-risk, that is, systems that pro- cess personal data to assess various aspects of a person’s life automatically, such 14 2. Theoretical Framework as health, economic situation, location or movement, etc. Researchers highlight that the use of ChatGPT in healthcare should be conducted with extreme caution considering its potential limitations (Sallam, 2023). Many further argue that LLM applications in healthcare should require close human oversight (Haltaufderheide & Ranisch, 2023). World Health Organization (2021) has identied a number of ethical principles for the responsible design and application of AI technology: protect autonomy; pro- mote human safety and well-being and the public interest; ensure transparency, explainability and intelligibility; foster accountability and responsibility, ensure eq- uity and inclusiveness; and promote AI that is sustainable and responsive. Referring to these principles by World Health Organization (2021), Harrer (2023) argues that ethical and legal frameworks are needed for deployment and use of AI applications and for the selection and management of training data. There is also a need for measures to mitigate model biases. The author highlights six key factors that need to be regarded for responsible ethical design, use and governance of AI in healthcare. 1. Accountability: Users need to be informed about the capabilities and risks of the technology as well as the sensitivities and responsibilities involved in using it. The current legal state is blurry and there is a need for legal clar- ity regarding accountability, responsibilities and rights of developers and users. 2. Fairness: General LLMs like ChatGPT that have been trained on data from the internet are faced with the critical issue of bias and how to make an AI system unlearn problematic content is a complex research question. Ethic panels currently need to review current model implementations and audit the performance of already deployed models to identify and eliminate sources of misinformation and bias. 3. Data privacy and selection: Electronic health records range among the most sensitive and highly restrictive data sources and need to be treated as such. The healthcare sector is an evidence driven domain and the act of choosing and accessing suitable training data to develop generative AI applications has ethical, legal and model performance related implications. 4. Transparency: Prompts are bridged with responses by LLMs that lack inher- ent capability of showing the logic behind their work, leaving human operators with this task who will only be able to do a reliable job if the models provide insights into its data sources. Camouaged generative AI content through au- tomated chatbots could become a dangerous source of misinformation, leaving a toxic mark on knowledge bases, risking contamination of knowledge bases at scale. Labeled AI-generated content could be needed. 5. Explainability: For generative AI systems, explainability should be a key de- sign feature as it provides an important datapoint to the human operator and user whose role it is to validate the soundness and correctness of AI generated 15 2. Theoretical Framework content before it is translated into action. Explainable and transparent AI systems could be assigned a truthfulness index, which could serve as a mea- surement for assessing the trustworthiness of an LLM when assisting clinicians and patients. 6. Value and purpose alignment: AI’s so-called Alignment Problem describes the ethical and existential risk that emerge when AI machines violate or do not follow the the purpose and values of their human creators and users. To avoid this, a human value system needs to be in place as well as a clearly dened system of what the AI-system should do and why. 2.5.1 Medical Device Some medical products in Sweden fall under the legal framework of medical device (Swedish Medical Products Agency, 2021). Below is a part of the denition of a medical device quoted from the Swedish Medical Products Agency (2021). "A medical device can be an instrument, apparatus, appliance, software, implant, reagent, material or other article. The condition is that a medi- cal device must be used on humans and have one or more of the following medical purposes: • diagnosis, prevention, monitoring, prediction, prognosis, treatment or alleviation of disease • diagnosis, monitoring, treatment, alleviation of, or compensation for, an injury or disability • investigation, replacement or modication of the anatomy or of a physiological or pathological process or state • providing information by means of in vitro examination of speci- mens derived from the human body, including organ, blood and tis- sue donations." 2.6 Healthcare in Sweden In a study by Anskär et al. (2019), the authors highlighted a signicant concern about the allocation of healthcare resources in Sweden. At that time, Sweden had a relatively higher number of healthcare sta per capita compared to most other Euro- pean countries. Despite these resources, the healthcare system in Sweden struggled with issues such as limited access and prolonged waiting times for diagnosis and treatment in both hospitals and primary care facilities (Anskär et al., 2019). Fast forward to today, researchers forecasts a severe sta shortage in the Swedish healthcare sector, occurring simultaneously with an aging population (Barkman & Aasa, 2023). Given the aging population, the demand for healthcare will also in- crease (Ahlstedt et al., 2023). Additionally, Grant Thornton (2023) notes that care queues will continue to grow signicantly each year. In 2023, the healthcare sec- tor continued to face challenges in providing patients with their rst consultation, 16 2. Theoretical Framework surgery, or other scheduled treatments within the timelines specied by the national care guarantee (Grant Thornton, 2023; Janlöv et al., 2023). This scenario indicates that fewer healthcare workers will need to manage increasing demands, even as the system fails to resolve the previously mentioned critical issues, underlining the press- ing need for signicant improvements in eciency. Currently, there is an ongoing debate concerning the level of administration in the Swedish healthcare and its impact on the care provided to patients (Ilicki, 2023b). Then given the increasing challenges faced by the healthcare sector, a greater em- phasis on planning and optimizing healthcare resources will be necessary (McKinsey & Company, 2019). However, to discuss the potential for reducing the administra- tive burden within healthcare eectively, it is essential to rst dene what is meant by ’administrative work’ and what is meant by ’clinical work’. 2.6.1 Denition of Administrative versus Clinical Work in Healthcare Settings Clinical work is related to the observation and diagnosis of patients (Strauss et al., 1985). Holten Møller and Vikkelsø (2012) describe clinical work as a circular process involving ve activities: (a) assessment of the patient’s health status and illness, (b) acquisition and analysis of clinical data, (c) making informed decisions regarding the patient’s health and illness, (d) providing further medical treatment for the patient’s condition and disease, and (e) observing the outcomes of the treatment. These work tasks are typically associated with direct patient care, which is generally regarded as more meaningful compared to other tasks performed by clinicians (Bringsén et al., 2012). Another way of phrasing clinical work is ’direct patient-related care’, dened as work in direct interactions with patients as well as phone communications with patients or their close relatives (Anskär et al., 2018). Broadly, administrative work is by Anskär et al. (2019) dened as communicating information between various professional roles, coordinating activities, and struc- turing systems, pointing out that information is one key word of administration. However, the word "administration" can have dierent meanings in dierent con- texts, which might explain why there are so many views on its scope, importance, and even the emotional connotations it carries (Holberg & Bell, 2015). In the Swedish healthcare, there is a lack of a suciently established denition for what is meant by ’administration’ (Holberg & Bell, 2015). Occasionally, the term ’administration’ covers activities indirectly related to patient care, such as managing health records and processing referrals. On the other hand, some reports specically address tasks that are clearly administrative, like billing and scheduling. On this topic, Anskär et al. (2019) distinguishes between patient-related and organization related administrative work tasks. The nature of patient-related work tasks typically encompasses activities such as documentation, dictation, scheduling appointments, managing health records and 17 2. Theoretical Framework referrals, as well as inputting data into healthcare records and quality registries, to provide a few examples (Anskär et al., 2018). This "patient-related administra- tion" could be dened as the administrative tasks that arise during a patient’s care episode, where every stage of healthcare requires or necessitates some form of doc- umentation and registration (Holberg & Bell, 2015). Meanwhile, organization related administration tasks involve for example oversee- ing equipment and facilities, handling emails, organizing schedules, and procuring medical supplies, including items like laundry (Anskär et al., 2018, 2019). These tasks are part of an administrative process which is continuous, yet does not have a direct connection to the patient (Holberg & Bell, 2015). Distinguishing between clinical and administrative work can be challenging, as these tasks often overlap and intertwine (Holberg & Bell, 2015). Also, given the substan- tial volume of administrative duties in Swedish healthcare, it might be benecial to categorize these tasks into those related to patient care and those associated with or- ganizational management (Anskär et al., 2018), for a clearer structure and overview of clinicians’ everyday work. The denitions used in this report are presented in the list below. A selection of examples of the subcategories are presented in table 2.1. • Direct patient work: Tasks that involve direct physical interactions with pa- tients as well as phone communications with patients or their close relatives. • Patient-related administration: Administrative tasks that arise during a pa- tient’s care episode, which are directly linked to the patient’s care. • Non-patient-related administration: Administrative tasks that do not have a direct connection to patients. 18 2. Theoretical Framework Table 2.1: Examples of direct patient work, patient-related administration, and non-patient-related administration. Direct patient work Patient-related administration Non-patient-related administration Face-to-face contact with patients Documentation of healthcare records Meetings (non-patient-related) Telephone contact with patients Reading EHRs Managing equipment and facilitation Telephone contact with patients’ next of kin Signing journal entries E-mail management (non-patient-related) Referral management Scheduling Prescribing medical drugs Managing computer problems Entering data into health records and quality regis- ters Ordering medical suppli- ers (as laundry) 2.6.2 Previous Studies on Time Utilization in Healthcare Settings As in many other countries, patient care at Swedish hospitals is a collaborative ef- fort, engaging various clinicians who rely on each other’s expertise and thus must work together closely (Bardram, 1997). The collaborative eorts consists of vari- ous tasks with dierent character. Barkman and Aasa (2023) contend that there has been a signicant rise in administrative duties in recent years. Numerous stud- ies indicate that physicians spend one-third of their time on direct patient-related activities, whereas nurses allocate approximately 43 % of their time to such tasks (Barkman & Aasa, 2023; Grant Thornton, 2023). Anskär (2019) investigated time utilization by healthcare sta in Swedish primary care. The study collected data using self-reporting forms, where participants both estimated and then documented in real-time the time dedicated to tasks catego- rized as direct patient work, patient-related administration, and other work tasks. According to the study, registered nurses reported spending 42.2 % of their time on direct patient care, 27.2 % on patient-related administration, and 30.6 % on other tasks. For physicians, the gures were 34.4 % for direct patient care, 32.3 % for patient-related administration, and 33.3 % for other tasks. Interestingly, both groups initially overestimated the time they would spend on direct patient care and underestimated the time for other tasks, as compared to their actual logged hours (Anskär, 2019). 19 2. Theoretical Framework For physicians, the the most time-consuming patient-related administration tasks were dictation (24.0 %), reading health records (16.8 %), signing documentation (13.0 %), and documentation of health records and order tests (11.9 %). For nurses, it were documentation of health records and order tests (51.6 %), reading health records (13.3 %), contact with other caregivers about patient cases (9.6 %) and ad- ministering appointments (6.8 %). Another study conducted by McKinsey & Company and the Swedish Medical Asso- ciation (Läkarförbundet) in 2018 found that Swedish physicians spent 7.5 hours per week on patient-related administrative tasks, which amounts to one full day each week dedicated to these activities (McKinsey & Company, 2019). Furthermore, a study in 2009 revealed that district nurses in Sweden allocated 30.47 % of their time to indirect patient care (Borelius, 2009). 2.6.3 Perspectives on the Administrative Burden Views on the administrative burden vary widely, with some researchers asserting that all administration is necessary, while others strongly contend that unnecessary administration exists (Eektiv vård, 2016). Ilicki (2023b) highlights a nuanced perspective on administrative tasks in health- care, pointing out that while some administrative work has direct medical value, it is often too time-consuming. For instance, Ilicki (2023b) notes that tasks like writ- ing discharge notes for patients who the physicians have never met, and reporting information to quality registers, are valuable for researchers and patients but not for the clinicians who perform these tasks. Additionally, Ilicki (2023b) emphasizes the ineciency of the login processes to clinician data systems, which, while time- consuming, also play a role in ensuring computer security. Regardless of opinions on whether the administration is unnecessary, it is clear that clinicians in Sweden face a signicant administrative workload and that excessive administrative tasks divert time from essential medical activities (Barkman & Aasa, 2023). Additionally, there has been a gradual shift of administrative responsibilities from administrative personnel to medical sta, suggesting that the deployment of medical resources may not be optimized for eciency (Anskär et al., 2019). 2.6.4 Illegitimate Tasks: Conceptualization and Implications for Employees in the Healthcare Sector Illegitimate tasks encompass work assignments that are perceived by employees as falling outside their role responsibilities, thus violating normative expectations of their professional duties (Björk et al., 2013; Semmer et al., 2015). Such tasks in- crease the amount of administrative duties (Basinska & Dåderman, 2023). For instance, instead of delivering direct medical care, employees may be tasked with duplicative documentation, generating various reports and summaries. According to the Stress-as-Oense-to-Self (SOS) theory, developed by (Semmer et al., 2007), 20 2. Theoretical Framework illegitimate tasks have been identied as a signicant social stressor, endangering one’s sense of self and infringing upon one’s professional identity (Semmer et al., 2019). In licensed professions, such as in healthcare, there is a strong link between an individual’s identity and their professional role (Aronsson et al., 2012). In these professions, stress often arises from conicts between one’s professional role, per- sonal identity, and self-evaluation. The professional role denes and suggests what is expected, as well as what is not expected under typical conditions, including tasks that are deemed illegitimate. Illegitimate tasks could be divided into two categories: unnecessary tasks and unrea- sonable tasks (Semmer et al., 2015). Unnecessary tasks are tasks that seem to lack a clear purpose or benet and often stems from organizational ineciencies, outdated ways of working or idiosyncratic demands (Kilponen et al., 2021). Employees may view tasks as unnecessary if they are fundamentally avoidable or if better organi- zation could have eliminated the need for them altogether (Kilponen et al., 2021; Semmer et al., 2019). For instance, having to re-enter data due to incompatible computer systems is seen as unnecessary because it could have been prevented if the systems were compatible from the start (Semmer et al., 2015). In healthcare settings, these characteristics may be linked to the notorious need for double and triple documentation caused by incompatible data systems (Barkman & Aasa, 2023). Unreasonable tasks on the other hand, are considered inappropriate for the per- son assigned to them, either because they do not align with their skills, are deemed a waste of time, or are tasks that should be performed by someone else (Björk et al., 2013; Semmer et al., 2015). This phenomenon is particularly prevalent in Sweden’s healthcare sector. According to McKinsey & Company (2019), one signicant factor contributing to low productivity within Swedish healthcare is that physicians devote a substantial portion of their time to tasks that should be performed by others. Whether the task is classied as unnecessary or unreasonable, it signals a lack of respect for the person who is expected to do it (Semmer et al., 2015). However, it is crucial to recognize that what is considered illegitimate in the workplace can vary depending on the context. Understanding behaviors that might be viewed as illegit- imate requires considering how individuals perceive their job responsibilities and the professional identity linked to their role (Semmer et al., 2009). Thus, could the same tasks be considered illegitimate for one person but not another (Semmer et al., 2015). The valuation of tasks varies based on whether they are central to the profession, peripheral, or completely outside the role (Aronsson et al., 2012). Tasks aligning with the expectations of the role holder are seen as legitimate, while those that exceed these norms are viewed as illegitimate. However, peripheral tasks are not inherently viewed as illegitimate, although they are more prone to this perception compared to core tasks (Semmer et al., 2015). These tasks can be seen as legitimate if they support, rather than obstruct, primary activities. However, according to the article by Barkman and Aasa (2023) on healthcare administration, the primary source of frustration among Swedish clinicians is not the tasks themselves, but the 21 2. Theoretical Framework manner in which they are executed. This suggests a connection to unnecessary ille- gitimate tasks, especially since the frustration often is related to data systems and organizational processes (Barkman & Aasa, 2023; Janlöv et al., 2023; McKinsey & Company, 2019). The situation of how a task occur also inuences legitimacy (Semmer et al., 2015). For instance, if a colleague is ill, it might necessitate another to temporarily take on duties beyond their usual role, which is typically not viewed as illegitimate. However, the administrative tasks that burden clinicians are not caused by unusual situations (Barkman & Aasa, 2023). The Ocial Report by the Swedish Govern- ment on healthcare in 2016 Eektiv vård (2016) highlights that over the past decade, legislative modications have predominantly augmented procedural and administra- tive regulations within healthcare settings in Sweden. Moreover, this accumulation of requirements persists, with new regulations being introduced regularly without any removal of obsolete ones — a situation unchanged to this day (Barkman & Aasa, 2023). 2.6.4.1 Illegitimate Tasks Impact on Well-Being Eectively assigning tasks can not only liberate clinicians to concentrate more on patient care but also boost job satisfaction, optimize skills, and cultivate a more ecient workforce (McKinsey & Company, 2019). Previous studies suggest that illegitimate tasks often cause stress for clinicians (Anskär et al., 2019; Thun et al., 2018) and have a detrimental eect on their well-being (Björk et al., 2013; Semmer et al., 2015; Thun et al., 2018). For example, illegitimate tasks has been discovered to correlate with adverse health and emotional consequences, such as diminished self-esteem (Eatough et al., 2016; Schulte-Brauck et al., 2019; Semmer et al., 2015; Sonnentag & Lischetz, 2018), decreased sleep quality (Semmer et al., 2015) and an elevated risk of burnout (Semmer et al., 2019; Werdecker & Esch, 2021). Fur- thermore, research indicates that illegitimate tasks can result in heightened cortisol release, greater occurrence of musculoskeletal pain (Kottwitz et al., 2013) and height- ened feelings of anger and frustration (Eatough et al., 2016). Additionally, studies has found that higher degree of illegitimate tasks lead to decreased job satisfaction (Björk et al., 2013; Werdecker & Esch, 2021). 22 3 Methods The objective of this chapter is to describe and discuss the research methods that have been applied for this study. The study is based on 37 interviews with health- care professionals at the neurology, ophthalmology and radiology department at Sahlgrenska University Hospital as well as 9 interviews with 10 experts within the eld of AI in healthcare from Sweden and Canada. Some data were collected via self-completion forms and eld notes through observations at the hospital. 23 3. Methods 3.1 Research Process Initially, data was gathered for constructing the theoretical framework and to facil- itate learning. A literature review was conducted about LLMs in healthcare that encompasses theory about the development of AI in healthcare, the technology of LLMs, the ethics of LLMs in healthcare and the denition and perspective of ad- ministrative work in healthcare. Furthermore, semi-structured interviews were conducted with clinicians from the neurology, ophthalmology and radiology department at Sahlgrenska University Hos- pital. These interviews were focused on identifying the clinicians’ needs for their ad- ministrative text-based tasks. The semi-structured interviews were complemented with self-completion forms that aimed to collect numerical data on how much time clinicians spend on a selection of administrative tasks. The selection of tasks were derived from the identied needs in the interviews with clinicians’. Some data were collected as eld notes from a number of observations the researchers did to gain understanding of the clinical setting at the hospital. In addition, expert interviews were conducted with individuals that work in the eld of AI in healthcare to gain deeper insights and learnings. The collection of expert interviews represents a small scale environmental scanning, which provides knowledge and inspiration from actors in Sweden and Canada on the topic of LLMs and AI in healthcare. Lastly, the collected data were analyzed through thematic analysis in relation to the theoretical framework. The objective of the discussion was to evaluate partly how LLMs can be used in healthcare in general, partly how LLMs can be used to streamline clinicians’ text-based tasks at the neurology, ophthalmology and radi- ology department at Sahlgrenska University Hospital. The potential of LLMs in healthcare was discussed, as well as the barriers. Figure 3.1 below, visualizes the methods that have been used for data collection, sampling and data analysis in this study. 24 3. Methods Figure 3.1: Data collection, sampling and data analysis methods. 3.2 Data Collection This study relies on both primary and secondary data. Primary data were gathered through 37 semi-structured interviews with healthcare professionals at Sahlgren- ska University Hospital (see table 3.1) and through 9 interviews with 10 experts in the eld of AI in healthcare (see table 3.2). Some primary data were collected via self-completion forms and eld notes from observation. Secondary data were collected as a complement to the primary data and were derived from an extensive literature review of reputable sources. Secondary data forms the base of the theoret- ical framework of this study. The initial data collection includes a broad spectrum of perspectives, and were followed by a more focused data collection in line with evolvement and renement based on emerging results. 3.2.1 Literature Review A literature review was conducted in order to collect relevant theory to this research. According to Easterby-Smith et al. (2015) a literature review describes, claries and evaluates what is already known about the topic. It is also a valuable tool for the authors to achieve in-depth learning about the topic early on in the process. This research requires a multifaceted knowledge base, including technical understanding of LLMs as well as knowledge of the complexities in healthcare where ethical, legal and organizational aspects are intertwined. Relevant literature were retrieved from public domains related to healthcare and medical technology, and also through the use of keywords in databases such as Google Scholar and ResearchGate. The challenge of doing a literature review is determining what information is trust- worthy and relevant to the research (Eriksson & Wiedersheim-Paul, 2014). Bell et al. (2019) supports this argument and suggests four criteria for quality assess- ment: authenticity, credibility, representativeness and meaning. These four criteria 25 3. Methods are fundamental for this research when retrieving data from existing literature. 3.2.2 Qualitative Interview Interviews have been a central part of this research to collect qualitative in-depth data. According to Bell et al. (2019), there are two main types of interviews: the unstructured interview and the semi-structured interview. The term "qualitative interview" is sometimes used to encapsulate these two types of interviews. In an unstructured interview the interviewee is asked very few questions and is allowed to respond freely, with the interviewer responding in turn only to points that seem worth following up on (Bell et al., 2019). Unstructured interviews tend to be similar to conversations. In a semi-structured interview the interviewee still has a great deal of leeway in how to respond, but the interview is somewhat structured to keep within fairly specic topics that the interviewer are prepared to cover with the help of a list of questions (an interview guide) (Bell et al., 2019). In this study, all 46 interviews have been semi-structured. Two interview guides have been used, one for interviews with clinicians and one for interviews with ex- perts. All interviews showed large variations, not least between clinical departments, making each interview unique. The two interview guides are presented in Appendix 1 (see section A.1 and A.2). The length of time for the clinician interviews was ap- proximately 30 minutes and the expert interviews were one hour, with the exception of two expert interviews that were 30 minutes long. 3.2.3 Sampling and Practicalities All interview objects were collected through purposive sampling. In qualitative re- search, most sampling entails purposive sampling of some kind (Bell et al., 2019). Purposive sampling implies that the sampling is conducted with reference to the goals of the research. This means that the selection of units of analysis are in terms of criteria that will allow the research questions to be answered. The clinicians’ from Sahlgrenska University Hospital that were interviewed were sampled through generic purposive sampling. In generic purposive sampling, the re- searcher denes criteria, identies suitable cases, and samples from them to address the research questions (Bell et al., 2019). The dened criterion in this study was to a beginning with healthcare professionals. It was later narrowed down to physicians and nurses. Participating healthcare professionals were collected by the help of the three clients for this study, who themselves work as residents at SU at the three departments that this study focuses on. The interviewed eld experts were sampled through snowball sampling. Snowball sampling involves the researcher initially engaging with a small group of individuals pertinent to the research topic and subsequently leveraging these connections to es- tablish contacts with others (Bell et al., 2019)). Initially, experts were sampled by the help of the supervisor for this study. Experts that participated in interviews in 26 3. Methods turn referred the researchers to other experts that were interviewed. In the beginning of each interview, the researchers asked for approval to record. All interviews were recorded and later transcribed with the help of Turboscribe and Microsoft Oce Word. Both researchers participated in all interviews with a few exceptions due to practical reasons. The majority of the interviews were conducted via Microsoft Teams, only a few were conducted on site at SU. The two researchers had clear roles throughout the data collection of this study. One of the researchers had the responsibility to conduct the interviews, and the other researcher was re- sponsible to take notes during the interviews. All clinicians who participated in interviews for this study are anonymously pre- sented in the report and all experts are mentioned by name. All experts who partic- ipated in interviews got the opportunity to be anonymous. The ten experts received a draft of the nal result-text with yellow markings of what they had said and got the opportunity to delete or change the context of the text. In each draft the other experts were anonymously marked with X in the researchers’ wait of approval. All experts approved to be mentioned by names in the report. The report’s results from the clinician interviews where looked over and approved by the clients for this study for each department. 27 3. Methods 3.2.4 List of Interviewees The clinician and expert interviewees are listed in table 3.1 and 3.2 below. As mentioned above, the scope of this study was narrowed down from healthcare pro- fessionals to physicians and nurses. However, as table 3.1 presents, a few interviews were conducted with other healthcare professionals at SU as well. The material is not included in the result of this study because of the study’s deliminations that were made because of time and resource constraints. However, the interviews in question were of high value for the researchers’ learning and understanding of the clinicians work at SU. The material from the physiotherapist and the assisting nurses showed similarities mainly with the administrative nurses’ text-based tasks, indicating that the results of this study could be generalized to other clinicians than physicians and nurses. Table 3.1: List of clinician interviews. Department Professional title Amount Neurology Nurse 8 Nursing assistant 1 Physiotherapist 1 Resident 4 Senior specialist 4 Ophthalmology Resident 9 Senior specialist 1 Radiology Medical secretary 1 Radiology nurse 2 Radiology nursing assistant 1 Resident radiologist 1 Senior radiologist 4 28 3. Methods Table 3.2: List of expert interviews. Name Company Professional title Amrit Krishnan Vector Institute Technical Team Lead Carolyn Chong Vector Institute Senior Product Manager Derek Beaton Unity Health Toronto Director, Advanced Analytics Hans van den Brink Västra Götalandsregionen Digital Strategist Isak Barbopoulos Sahlgrenska University Hospital Machine Learning Engineer Jonathan Ilicki Industrifonden Principal Magnus Kjellberg Sahlgrenska University Hospital Head of Kompetenscentrum AI & Chief Data Scientist Markus Lingman Region Halland Chief Strategy Ocer & Senior Consulting Cardiologist & AI Swede of the Year 2020 Michael Page Unity Health Toronto Head of AI Commercialization Rosita Ahlstedt Region Halland Department Manager Decision Support and AI 3.2.5 Diary as a Self-Completion Questionnaire An eective alternative to traditional self-completion questionnaires is a diary, par- ticularly useful for precise estimates of various behaviors. Bell et al. (2019) identied two types of diaries: structured and free-text, where the structured is resembling a questionnaire with closed questions, commonly known as a time-use diary (Bell et al., 2019). Thus, this kind of diary as a self completion questionnaire is an eec- tive method for collecting data when fairly precise estimations of time spent on a certain behavior are required (Bell et al., 2019). Often, for a certain objective where sequencing of a behavior is required, it oers more accurate results than interviews or estimation questionnaires. In this study, ve structured diaries as self completion questionnaires were devel- oped, one for radiologists, one for physicians at the ophthalmology department, one for nurses at the neurology department and one for the physicians at the neurology department. Additionally, a new diary was developed for the neurology physicians 29 3. Methods after receiving feedback on the rst. However, due to time and resource constraints, only the diary for neurology physicians received valid responses. Although the num- ber of responses was too small to be representative of the entire population, they were still considered useful for this study’s results. The diary as a self completion questionnaire that has been distributed among clin- icians at the hospital to measure the time they spend on certain administrative text-based tasks is called ’self-completion form’ throughout this study. 3.2.5.1 Sampling for the Diary as a Self-Completion Questionnaire For the diary as a self-completion questionnaire, generic purposive sampling was conducted. This method involves dening criteria and selecting samples based on them to address the research questions (Bell et al., 2019). Similar to the clinician interviews, the criteria for this sample included physicians and nurses from the neurology department, radiologists from the radiology department, and physicians from the ophthalmology department. 3.2.6 Observations for Understanding and Learning In order for the researchers to gain a more comprehensive understanding of the clinical setting a number of observations were conducted at Sahlgrenska University Hospital. One day was spent at the neurosurgery department observing two neu- rosurgical procedures, one day was spent observing a senior specialist at the stroke department and one half day was spent observing radiologists at the radiology de- partment. Bell et al. (2019) explain how shadowing is a form of observation that anities with the notion of passive participant observation. The researchers follow a member of an organization throughout his or her working day, and can ask the observed participant questions about what he or she is doing. The researchers may write eld notes, recording the times, conversation subjects, mood and body lan- guage. During the observations that were conducted in this study, the researcher’s asked questions to learn of things clinicians do. Some eld notes were collected during these occasions. 3.3 Data Analysis A thematic analysis was conducted to analyze the data obtained from the semi- structured interviews. According to Bell et al. (2019), a thematic analysis imply focusing on identifying concluding themes. The three cases from neurology, oph- thalmology and radiology, and the expert interviews was all handled as separate cases, where the data categories were generated by applying a general framework for qualitative analysis proposed by Grodal et al. (2021). The proposed framework follows three steps: 1. generating initial categories, 2. rening tentative categories and, 3. stabilizing categories (Grodal et al., 2021). During the rst stage, generating initial categories, two methods were used: asking 30 3. Methods questions and focusing on puzzles. The questions asked, were predened related to this study’s research questions. The interview questions for clinicians centered around three main themes, patient-related administrative text-based tasks, primary bottlenecks and suggested eciency improvements. For the interviewed experts, the interview questions also centered around three themes, LLMs’ potential in health- care, risk evaluation, and implementation strategies. By asking questions, initial (possible) categories were crafted during the collection and analysis of data, allow- ing for proactivity, which is recommended by (Grodal et al., 2021). Moreover, in formulating the initial data categories from the responses to the interview questions, emphasis was placed on discerning patterns and isolating the most pertinent data for addressing the research questions of the study. This approach involved selectively prioritizing relevant data to navigate the complexity and volume of information. Secondly, the initial data categories underwent analysis and restructuring based on connections between them. Certain categories were subdivided to enhance clar- ity and dierentiation. For instance, risk evaluation was divided into technologi- cal risks and ethics and legalisation, while patient-related tasks were divided into patient-related tasks where information is retrieved from EHRs and patient-related tasks that is not. Categories were then organized sequentially, acknowledging dy- namic relationships. This method facilitated the discovery of novel interconnections between concepts (Grodal et al., 2021). Third, in the concluding phase of the analysis, the ndings underwent a reevalu- ation, incorporating insights from the theoretical framework and the other cases. This involved a comparison of the results with relevant theories to discern support- ing and conicting perspectives, facilitating the formulation of conclusions. This stabilizing stage often assists researchers in oering answers to or clarifying their initial questions or puzzles (Grodal et al., 2021). 3.4 Research Quality To review the quality of a qualitative research study Bell et al. (2019) points out credibility, transferability, dependability, and conrmability as important criteras that need to be considered. Credibility refers to the condence in the truth of the data and the interpreta- tions of them (Bell et al., 2019). It involves ensuring that the research ndings ac- curately represent the participants’ perspectives. Techniques to enhance credibility include for instance prolonged engagement with participants, persistent observation, and respondent validation. Hence, after compiling of the interviews and results for this study, expert interviewees were notied that they could either conrm the in- formation, change it, or reject it. The results from the clinicians’ interviews and observation were sent to the three clients from Sahlgrenska University Hospital for approval of the included information. Transferability addresses the extent to which the study’s ndings can be applied 31 3. Methods to other contexts (Bell et al., 2019). Qualitative research does not aim for broad generalization but seeks to provide rich, detailed insights that others can relate to their own situations. Providing thorough descriptions of the research context, par- ticipants, and methodology allows others to judge the applicability of the ndings to dierent settings. Therefore, by utilizing the interview guides provided in Appendix 1, A.1 and A.2, as part of the data collection tools, other researchers can replicate the study’s concepts, thereby enhancing the transferability of the ndings. Dependability is concerned with the consistency and stability of the research pro- cess if it was repeated (Bell et al., 2019). To ensure dependability, researchers maintain an audit trail documenting all phases of the study, from data collection to decision-making processes. Furthermore, problem formulation, selection of research participants, eldwork notes, interview transcripts, data analysis decisions should be documented to ensure dependability. Throughout this study, the researchers aimed to ensure transparency in problem formulation, participant selection, data collection, and data analysis methods. All interviews were transcribed, and all eld notes were thoroughly documented. Conrmability ensures that the ndings are shaped by the participants’ responses rather than researcher bias (Bell et al., 2019). Researchers can maintain objectivity by keeping a reexive journal to document their reections, biases, and decision- making processes. To mitigate potential biases in this study, some measures were implemented. The inclusion of a large number of interviewees from Sahlgrenska University Hospital potentially reduced bias a