Challenges in Specifying Safety-Critical Systems with AI-Components Master’s Thesis in Computer Science and Engineering ISWARYA MALLESWARAN & SHRUTHI DINAKARAN DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Master’s Thesis 2022 Challenges in Specifying Safety-Critical Systems with AI-Components ISWARYA MALLESWARAN & SHRUTHI DINAKARAN Department of Computer Science and Engineering Software Engineering Division Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 Challenges in Specifying Safety-Critical Systems with AI-Components ISWARYA MALLESWARAN & SHRUTHI DINAKARAN © ISWARYA MALLESWARAN & SHRUTHI DINAKARAN , 2022. Supervisor: ERIC KNAUSS & HANS-MARTIN HEYN, Department Of Computer Science and Engineering Examiner: ROBERT FELDT, Department of Computer Science and Engineering Master’s Thesis 2022 Department of Computer Science and Engineering Software Engineering Division Chalmers University of Technology University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2022 iv Challenges in Specifying Safety-Critical Systems with AI-Components ISWARYA MALLESWARAN & SHRUTHI DINAKARAN Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Abstract Safety is an important feature in automotive industry. Safety critical system such as Advanced Driver Assistance System (ADAS) and Autonomous Driving (AD) follows certain processes and procedures in order to perform the desired function safely. Many ADAS applications relies significantly on Machine Learning and data needed to perform the desired function. Data quality, more specifically the information content of the data, can highly impact the effectiveness of the model and its function. It is important to select the right data to train the model. Furthermore, monitoring the safety critical system during runtime helps to understand the data which the model receives. Such information helps further to create and update machine model. There are uncertainties and challenges in defining the requirements for finding the right information content of the data such that the desired and a safe behaviour of the system is ensured. This case study investigates and explores the challenges experienced in creating the requirements for proper selection of training data. It also analyzes challenges when specifying runtime monitoring and the relation between requirements on runtime monitoring and the training data. This case study follows the approach of qualitative and exploratory research. The analysis for this study is based on ten interviews with experts from different field. Moreover, a workshop has been conducted with academic and industry experts to validate the results from our interview analysis. Based on the qualitative analysis of data, the case study shows that there is lack of clarity in defining requirements, lack of communication, no clear scope of design domain, missing guidelines for data selection and safety requirements, and a lack of metrics for defining the right variety of data and runtime monitors. The results outline challenges experienced by practitioners when specifying data and defining requirements for runtime monitors for safety critical systems. Keywords: Software engineering, Requirement engineering, Specification, Safety, Computer Science, Engineering, Machine learning, Software engineering, Require- ment engineering, Deep learning, Runtime monitor, Data Selection, Data Collection. v Acknowledgements We would like to extend our gratitude to our supervisors Hans-Martin Heyn and Eric Knauss for their timely support and guidance throughout the study. Their thought- ful feedback in all the discussions we have had proved to be useful and effective. From the industrial side at Veoneer AB, we would like to thank Stefan Andersson, Olof Eriksson and Oliver Brunnegård for supporting us with the required partici- pants and discussion for the study. A special note of thanks to our examiner Robert Feldt for taking the time to review our thesis and providing constructive feedback on the same. We would like to acknowledge all the participants in our interviews and workshop for their time and responses provided. I, Iswarya owe thanks to my husband, my kid, my family, my friends and my thesis partner for their strong motivation and support throughout the thesis. I, Shruthi owe my gratitude to my family and friends for their love, support and encouragement throughout the thesis. I would also like to thank my friend and thesis partner Iswarya for her support and great collaboration. Iswarya Malleswaran, Shruthi Dinakaran Gothenburg, October 2022 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AD Autonomous Driving ADAS Advanced Driver Assistance System AEB Automatic Emergency Braking AI Artificial Intelligence ASIL Automotive Safety Integrity Level BDD Berkeley Deep Drive CAN Controller Area Network CEO Chief Executive Officer COCO Common Objects in COntext DAMA Data Administration Management Association E/E/PE Electrical/Electronic/Programmable Electronic FDT Fault Detection Time FPGA Functional Programmable Gate Arrays FRT Fault Reaction Time FTTI Fault Tolerant Time Interval FuSa Functional Safety GDPR General Data Protection Regulation GTSRD German Traffic Sign Recognition Database IEC International Elecrotechnical Commission ISO International Organization for Standardization KPI Key Performance Indicator LIDAR Laser Imaging Detection And Ranging ML Machine Learning ODD Operational Design Domain PAS Publicly Available Specification QM Quality Management RE Requirements Engineering RQ Research Question SIL Software In Loop SOTIF Safety Of The Intended Functionality VEDLIoT Very Efficient Deep Learning in Internet of Things VIPER VIsual PERception benchmark ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Case Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background and Related work 5 2.1 Data quality and deriving requirements . . . . . . . . . . . . . . . . . 5 2.2 Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Runtime Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Research Method 13 3.1 Qualitative Research Method . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Preparation for Data Collection . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.1 Pre-Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.2 Tools used for Coding . . . . . . . . . . . . . . . . . . . . . . 20 3.4.3 First Cycle Coding . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.4 Second Cycle Coding . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Results 23 4.1 Challenges - Requirements for Data Selection . . . . . . . . . . . . . 23 4.1.1 Difficulty in handling the amount of data . . . . . . . . . . . . 23 4.1.2 Finding the right variety of data . . . . . . . . . . . . . . . . . 24 4.1.3 Finding data with the right information content . . . . . . . . 26 xi Contents 4.1.4 Clarity in defining requirements for data . . . . . . . . . . . . 26 4.1.5 Applying safety requirements (e.g, from safety standards) for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.6 Missing guidelines for data selection . . . . . . . . . . . . . . . 28 4.1.7 Unclear design domain / context definition . . . . . . . . . . . 29 4.2 Challenges- Runtime Monitoring . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Difference in understanding of runtime . . . . . . . . . . . . . 31 4.2.2 Being Time critical . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.3 Keeping it lightweight . . . . . . . . . . . . . . . . . . . . . . 32 4.2.4 No access to inner states of model . . . . . . . . . . . . . . . . 34 4.2.5 Finding conditions that can be checked at runtime . . . . . . . 34 4.2.6 Trade off between safety and reliability . . . . . . . . . . . . . 36 4.2.7 Impact of Safety standards . . . . . . . . . . . . . . . . . . . . 37 4.2.8 Defining metrics for runtime checks . . . . . . . . . . . . . . . 38 5 Discussion 41 5.1 Discussion and Main Findings . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Triangulation to Literature . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 Implications for Research . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4 Implications for Practice . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Validity and Ethical Consideration . . . . . . . . . . . . . . . . . . . 43 5.5.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5.2 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.3 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.4 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5.5 Conclusion Validity . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5.6 Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5.7 Confidentiality and Anonymity . . . . . . . . . . . . . . . . . 45 6 Conclusion 47 A Appendix 1 I A.1 Interview Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.2 Interview Questions . . . . . . . . . . . . . . . . . . . . . . . . I A.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II B Appendix 2 V B.1 Workshop Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . V C Appendix 3 XV C.1 Coding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XV D Appendix 4 XIX D.1 Fishbone Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIX xii List of Figures 2.1 Requirements Engineering Process, adopted from the literature, [Vogelsang and Borg, 2019] 7 3.1 Different stages in the Research method . . . . . . . . . . . . . . . . 14 3.2 Steps in Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Atlas.ti Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 Challenges in finding the right variety of data - Cause and Effect analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Challenges in applying safety requirements to data - Cause and Effect analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Unclear design domain - Cause and Effect analysis . . . . . . . . . . . 30 4.4 Difficult in keeping it lightweight - Cause and Effect analysis . . . . . 33 4.5 Challenges in finding conditions that can be checked at runtime - Cause and Effect analysis . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 Unclear scope and impact of Safety Standards - Cause and Effect analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 D.1 Fishbone Diagram RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . XX D.2 Fishbone Diagram RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . XXI xiii List of Figures xiv List of Tables 3.1 List of Interviewees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 List of Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 xv List of Tables xvi 1 Introduction With the future of the automotive industry focusing on electrification and mobility, safety is of crucial importance when it comes to the development and implementa- tion of critical applications. Safety has gained more attention in the recent years, especially for automobiles, as ensuring the safety of the people in and around the vehicle is a top priority. Features like radar, LiDAR, camera-based systems, and image processing devices were introduced to make driving more comfortable, reliable, and safer [Belmonte et al., 2020]. A complex safety critical system must be developed and validated using system- atic methods to avoid systematic mistakes. For a system to be safe, it has to adhere to a list of processes and procedures. Different international standards and organizations are working to define a set of autonomous driving levels, func- tional safety levels, requirements, and characteristics for ADAS & AD systems, [Litman, 2017, Smith and Svensson, 2015, Jiang et al., 2015, ISO, 2011]. Many ADAS application relies significantly on Machine Learning (ML) and the data needed to perform a specific function [Kim et al., 2017]. Incorporating ML into the system requires a large amount of data from various sensors in the automo- bile. The information content will impact the dataset distribution and the Artificial Intelligence (AI) model’s effectiveness in generalizing them. Information about the training dataset is a crucial part of developing an ML model, which should operate as intended. Wrong information content and insufficient amount of data lead to a bias in the dataset which makes the model underperform when deployed in the field. It is therefore important that the data is of good quality and can be used as a reliable source of information for safe implementation. If data is not of sufficient quality, it can lead to hazards resulting in injuries or, in the worst case, fatal ac- cidents [Sessions and Valtorta, 2006]. Data quality can have two aspects: First, it can refer to specific quality attributes of the data, such as the resolution of image files, or compression rates of video streams. And, it can also refer to the information content of the data [Vogelsang and Borg, 2019]. Hence, we need to have a proper understanding of both aspects of the data to argue that the system is safe enough to be released in the field. This thesis explores the challenges encountered when deriving requirements for find- ing the right information content of the data that ensures the desired behavior of the machine learning system. 1 1. Introduction Many ADAS applications use deep learning to fulfill the desired function. In this study, we focus on Very Efficient Deep Learning in Internet of Things (VEDLIoT)1 use case of Automatic Emergency Braking (AEB) which is a safety-critical function requiring strict functional safety requirements on the system. AEB is a critical com- ponent of any ADAS. AI in this use case an ML model, specifically a Deep Neural Network) is used to identify obstacles in camera images using image recognition. For AEB to work as intended, there must be safety mechanisms implemented that ensure safe operation and if necessary a safe stop of the vehicle. However, a chal- lenge is to determine which data should be used for training and operation of the deep neural network, such that the desired functionality (i.e., detecting dangerous obstacles on road) is safely fulfilled in the given context. Specifically, a challenge is creating requirements and maintaining them with the proper traceability. One must have a clear understanding of the system and its be- havior to define requirements. Requirements focused on defining the desired dataset according to the use case are a difficult task of machine learning. Breaking down requirements into component levels is also a challenge because one must ensure that the system is safe from all possible faults that can occur. With sensor systems meant for measurement and image recognition, there will be several requirements originating from functional safety for example technical requirements regarding the accuracy of the sensors or the need of having redundant mechanisms in place. In the automotive industry, a combination of data quality in safety-critical sys- tems and decisions of such systems are made at different locations and levels in a large distributed system. A common, distributed, and decentralized paradigm is required to make the best use of local and global data models as well as determine how to distribute learning and reasoning across nodes to fulfill extreme latency re- quirements. Runtime Monitoring of a safety-critical system allows us to understand the data which the model receives. Such information will allow us to understand the ex- pected performance of the model. The need for developing such a high-performance model requires the creation of a requirement model that includes runtime monitoring aspects. When arguing for the safety of a product, we need to establish a connection between the requirements we have on the training data to the requirements we have during runtime monitoring. In order to monitor the incoming data, one needs to understand the requirements of the data to define monitoring goals at runtime. This in turn has a lot of challenges in doing them. In this thesis, we perform a qualitative study by conducting interviews with in- dustry experts in the field. Qualitative study [Saldaña, 2021] enables us to investi- gate, explore and gain a deeper understanding of the subject. This study focuses on investigating the challenges faced while specifying the training data and runtime monitors for safety-critical applications. 1Very Efficient Deep Learning in the IoT, an EU Horizon 2020 project, see www.vedliot.eu 2 www.vedliot.eu 1. Introduction The thesis is organized as follows: In Chapter 1, we introduce the problem state- ment, the case company, the purpose of the study, the research questions, and the scope and limitations of the study. Chapter 2 presents the background and related work to this study. Then, Chapter 3 outlines the research methods for this study. This chapter furthermore explains the different stages of this qualitative study and the reasoning behind the choices made. Chapter 4 presents the results gathered and analyzed from the different interviews and a workshop. The results will consist of the challenges and improvements explained in specifying data and runtime monitors for critical AI such as they are found in ADAS. Chapter 5, discussion about the results, their implications, and the validity of the results are presented. Thereafter, Chapter 6 concludes the thesis. 1.1 Case Company This thesis is done in collaboration with the company, Veoneer Sweden AB. Veoneer is a global Tier 1 supplier that works with designing and developing state-of-the- art systems and solutions for ADAS. Industry experts and supervisors at Veoneer have provided support and guidance throughout the project. Veoneer also provided the necessary information related to the system and also has resources dedicated to supporting us with interviews and other relevant workshops. 1.2 Purpose of the Study The purpose of this study is to highlight the challenges experienced by practitioners when trying to specify data and runtime monitors for critical applications such as ADAS. 1.3 Research Questions The study aims to address the problems that are discussed above by investigating the issues further through a qualitative study to understand what the challenges are and possible ways to overcome them. The first research question explores the current challenges faced in deriving require- ments for data used in critical AI applications in the automotive industry: RQ1: What are the challenges encountered in practice when deriving requirements for AI components in particular concerning the selection of training data for safety- critical applications? The second research question tries to establish a connection between runtime mon- itoring and training data: RQ2: What is the role of runtime monitoring 3 1. Introduction in the aspect of data, in defining safety requirements and supporting safety argumen- tation? 1.4 Scope and Limitations In this study, we are not constructing a full prototype, instead, we addressed cor- nerstones of how a solution might look like by investigating the current challenges in specifying data and runtime monitors for safety-critical AI. In this thesis, we interviewed experts who work with research, development, and implementation of critical AI systems, especially in the automotive industry. The findings of the study should apply to autonomous drive in general and not specific to a company, as we are interviewing experts from different companies. For validation, we conducted a workshop with experts within the field. This would help generalize to other domains. 4 2 Background and Related work In this chapter, we discuss the previous research and concepts related to the study. The chapter starts with Section 2.1 which presents the literature related to data quality and how to derive requirements. Then, Section 2.2 and Section 2.3 presents literature regarding safety and runtime monitoring respectively. A previous study, which is based on VEDLIoT, puts light on the challenges faced in deriving requirements for use-cases for the system and how to define the Op- erational Design Domain (ODD) for it [Heyn et al., 2022]. The researchers have suggested certain improvements for these based on their research analysis. In this study, we have focused on identifying the challenges faced in deriving requirements for training data and runtime monitoring. For background study, an online research paper search was performed for data qual- ity and deriving requirements, safety, and runtime monitoring for critical AI. The criteria employed to source the research papers were finding recent publications (al- though there are few exceptions) and referring to those papers that have been cited by many other corresponding papers. 2.1 Data quality and deriving requirements Data quality can be defined as “the planning, implementation, and control of activi- ties that apply quality management techniques to data, in order to assure it is fit for consumption and meet the needs of data consumers” [DAMA International, 2017]. Systems based on deep learning require and gather a large volume of data that needs to be managed and processed for effective decision-making. Such decisions can only be made if the input data is of good quality. Data quality should also consider the quality attributes such as safety and robustness [Heyn et al., 2021]. Although data is probably the most important aspect of a machine learning ap- plication, there is no proper system to determine and manage the required quality and quantity of the data. According to [Heyn et al., 2021], the researcher mentioned that, after the introduction of more rigid data privacy rules, such as General Data Protection Regulation (GDPR), there is a growing pushback against the idea to “collect as much data as possible” for a machine learning application. The author further added that more data is collected in the hope that the right data might 5 2. Background and Related work be among them. [Webb et al., 2001] also talks about the above statement in their paper. [Beigelmacher and Lander, 2020] explain the importance of quality training data for a machine learning model. Data scientists who are experienced in fitting machine learning models prefer the data to be well structured, labeled with high quality, and ready to be analyzed. They also state that the purpose of training data is not only restricted to training the model but also to retrain the model throughout the AI development lifecycle. As real-world conditions evolve, the initial training data may be less accurate in its representation of ground truth. This requires us to update the training data and hence retrain the model. It is this training data that needs to be specified properly for the system to behave as intended. They also highlight the important factors that affect training data quality which are People, Process, and Tools. People include the actual workforce who gather and work with the data. People with different levels of experience and training have an impact on the selection of training data. Processes which are basically communica- tion protocols and business rules also have an impact on training data. The tools that are used to label the data, the technology, and platforms in which they are used how it is communicated to the workforce impact the training data as well. According to [Heyn et al., 2021], Data and especially their representation in the form of probability distributions are the core of machine learning. Different types of data (input data, training data, test data, etc.) play a role when deploying and using machine learning or deep learning. [Vogelsang and Borg, 2019] studied re- quirements engineering for machine learning-based applications in their paper and described challenges in Machine learning systems. Data requirements are one of the five challenges identified by the authors. Data requirement is divided into data quan- tity and data quality. The authors present a requirements engineering process as well. The process includes the following steps, also specified in Figure 2.1 which was derived both from [Vogelsang and Borg, 2019] and several other literature related to Requirements Engineering Activities. • Elicitation • Analysis • Specification • Validation A system would need to have certain data requirements to be fulfilled for proper functioning. Data requirements are requirements that data should adhere to in order to be ef- fective in the operation of a system. An example of a data requirement could be, that the data shall represent a given probability distribution for which the AI has been trained. Only then can a machine learning model arrive at the right decision [Heyn et al., 2021]. The data involved in an ML model is equally important to the ML model itself 6 2. Background and Related work Figure 2.1: Requirements Engineering Process, adopted from the literature, [Vogelsang and Borg, 2019] [Breck et al., 2017]. According to [Unger et al., 2020], there is a huge demand for robust training data. There have been several tests done with existing datasets, such as Microsoft Commom Objects in COntext (COCO) along with certain additional training datasets such as German Traffic Sign Recognition Database (GTSRD), VI- sual PERception benchmark (VIPER), and Berkeley Deep Drive 100k (BDD). This was tested on a Convolutional Neural Network (CNN) to study the influence of training datasets during nighttime and low visibility traffic scenarios. This then resulted in an improvement to the training dataset selection. Even the training data needs a set of requirements to be fulfilled in order to consider qualified data. According to [Heinrich et al., 2018], in order to check whether the data taken is of good quality, we can use the data quality metrics. The authors demonstrate the applicability of these requirements by evaluating data quality met- rics for different data quality dimensions. [Vogelsang and Borg, 2019] explores and defines the Requirement Engineering method- ology for ML systems. They interviewed data scientists to better understand their perception of Requirements Engineering and the challenges involved in creating re- quirements for Machine learning systems. From their interviews, the data scientists stress the statement that “training data needs specified and validated requirements like code”. They mention that an important activity of a requirements engineer is to identify and specify requirements regarding the collection of data, the data formats, 7 2. Background and Related work and the ranges of data. This information needs to be elicited from the problem domain and serves as an input for data scientists. Requirements engineers must understand the importance of data provenance, i.e., to critically question the data sources. They also argue that the development of ML systems demands require- ment engineers to understand ML performance measures to state good functional requirements, be aware of new quality requirements such as explainability, freedom from discrimination, or specific legal requirements, and integrate ML specifics in the Requirements Engineering (RE) process. [Gauerhof et al., 2020] states that in order to assure the safety of an ML model for pedestrian detection at crossing scenarios, explicit and concrete safety requirements are mandatory. Initially, requirements will be created at the vehicle level. Con- sidering the V-Model, safety requirements are created for the specific component. Here, it is created for an object detection component. These are then categorized as Performance requirements and Robustness requirements. They also explain how requirements are created for data management and model learning phases. These are categorized as Relevant, Complete, Accurate, and Balanced requirements. Fi- nally, they try to establish traceability between the system safety requirements and the ML safety requirements, hence providing sufficient arguments for proper safety assurance. 2.2 Safety Compliance with end-user expectations is a central aspect of the design of machines, vehicles, and control systems. The increasing use of Programmable Electronic Sys- tems has increased the complexity and thereby made it harder to develop such systems in a safe and reliable way. According to [Han, 2007], Safety is “freedom from unacceptable risk of physical injury or of damage to the health of people”. Safety is not a property that can be added at the end of the design. Instead, it must be an integral part of the entire engineering process. To successfully engineer a safe system, a systematic safety analysis and a methodological approach to managing risks are required [Bahr, 2014]. There is not much exploration of safety-critical sys- tems and the distributed components in them. We hope to explore more, gain some insight and provide valuable suggestions for these. Safety analysis comprises the identification of hazards, development of approaches to eliminate hazards or mitigate their consequences, and verification that the ap- proaches are in place in the system. Risk assessment is used to determine how safe a system is, and to analyze alternatives to lower the risks in the system. An important thing to note when writing safety requirements is to ensure that they hold the right integrity level and that the safety mechanism to be implemented is independent of the normal function. For software-intensive systems, the generic meta-standard IEC 61508 [Han, 2007] from International Electrotechnical Commission introduces the fundamentals of func- 8 2. Background and Related work tional safety for electrical/electronic/programmable electronic (E/E/PE) safety- related systems, that is, hazards caused by malfunctioning E/E/PE systems rather than non-functional considerations such as fire, radiation, and corrosion. Several different domains have their own adaptations of IEC 61508. ISO 26262 is the auto- motive derivative of IEC 61508 defined by the International Organization for Stan- dardization. ISO 26262 is an established standard for Functional Safety (FuSa) of road vehicles. It is organized into ten parts, constituting a comprehensive safety standard covering all aspects of automotive development, production, and mainte- nance of safety-related systems. The commonly used development processes in automotive follow along the V-model in ISO 26262. The V-model assumes that all requirements are known and so, com- plete, correct, and unambiguous, which is not true in the case of using Machine learning components. This highly recommended method for “Software unit design and implementation” makes it evident that it is highly code-oriented and is tested against given requirements. But, it does not provide guidance for new technologies such as Machine learning. The quality and safety assurance activities are performed during development time, which works well for well-defined functions. However, for a function built using machine learning, the output is highly dynamic. [R. Salay and Czarnecki, 2017] presents the adaptation of ISO 26262 for machine- learned components, and they consider the complexity of an end-to-end trained neu- ral network to be too high, as the standard requires a division into small functional units. Since it is evident that ISO 26262 is no longer sufficient for the next gener- ation of ADAS and AD systems. [Borg et al., ] discuss a complementary standard to ISO 26262 under development as ISO 21448 Safety of the Intended Functionality (SOTIF) (ISO, 2019). SOTIF is a standard that aims for the absence of unreason- able risk due to hazards resulting from functional insufficiencies – also for systems that rely on ML. Standards such as SOTIF demand high-level requirements on what a development organization must provide in a safety case for an ML-based system. Although these automotive safety standards provide the necessary processes and guidelines for defining a safe system, there are still major challenges faced by the automotive industry when defining safety requirements. We will address these chal- lenges and provide improvement suggestions to overcome them. CNN are becoming widely used computational methods with machine learning sys- tems. As per [Torino et al., 2019], to ensure safety in a vehicle involving ADAS, there have been tests done to check the reliability of these neural networks. This is performed by deliberately inserting faults into the training model to see if we get different results apart from the obvious. Following ISO26262 standard, tests were conducted on all levels of the system to boost confidence. We also try to employ similar safety standards in our research. [Koopman and Wagner, 2016] analyzed the challenges for testing and validation of autonomous vehicles regarding ISO 26262. The problem with machine learning systems is that there are no explicit requirements that can be tested according 9 2. Background and Related work to the V-model in ISO 26262, instead, the requirements are implicitly encoded in the training data. To achieve a high level of safety, it is proposed to monitor the machine-learned components. This focuses most of the validation problem on the monitoring component. 2.3 Runtime Monitoring Due to the dependency between the behavior of an ML system and the data it has been trained on, it is crucial to define actions that ensure that training data cor- responds to real data. [Vogelsang and Borg, 2019] mentioned performance on the training data can be specified as expected performance that can immediately be checked after the training process, whereas the performance at runtime (i.e., during operations) can only be expressed as desired performance that can only be assessed during operations. Since data characteristics, in reality, may change over time, re- quirements validation becomes an activity that needs to be performed continuously during system operation. From the interviews conducted by [Vogelsang and Borg, 2019], the interviewees (data scientists) agreed that monitoring and analyzing runtime data is essential for main- taining the performance of the ML system. They also agreed that ML systems need to be retrained regularly to adjust to recent data. This enables the system to be free from errors and faults. There could be faults ranging from systematic to ran- dom hardware faults. All these can be minimized when we take the runtime data into account for the ML model. By analyzing the problem domain, a requirements engineer should specify when and how often retraining is necessary. [Schratter et al., 2018] explores a methodology where accident data is used to de- velop a braking strategy for AEB (Automatic Emergency Braking). They use ma- chine learning to predict driving scenarios and trajectories which are fed as training data to the ML model. Driver monitoring is also used to capture the drivers’ be- havior during critical situations. Furthermore, they state that a model is only an approximation of the real world and has by definition, certain deviations from real- world scenarios. We can learn a lot about this incorrect behavior from the ML algorithm. However, real data must still be used to get a realistic behavior of the learned algorithm. Based on their analysis it is evident that a machine-learned safety-critical system can only be developed based on real-world data. [A. Kane and Koopman, 2015] performed black box monitoring without using neu- ral networks. They developed a real-time monitoring system for an autonomous research vehicle that observes the Controller Area Network (CAN) bus passively. They reported that vehicles are equipped with commercial-off-the-shell components, which cannot be instrumented for runtime monitoring, they have to be treated as a black box. [Watanabe et al., 2018] applied runtime monitoring to detect when the system tran- 10 2. Background and Related work sitions into an unsafe state or when it violates a critical safety requirement. They talk about the use of runtime data to better design intelligent systems. This more or less helps to support the initial design and data provided. We also try to an- alyze in our study if runtime data can be used to retrain the machine learning model. Analyzing the system during runtime is as difficult as writing requirements for them. One has to constantly gather field data in order to be properly equipped with all the information needed in creating such requirements. This is a challenge that is seldom analyzed by industry experts. The relationship between monitoring the ma- chine learning component during runtime and requirements on the training data was missing in the existing literature which is covered in this thesis. We identified not many literature that explore the support of safety standards (like advanced safety standards for AD) for both training data and runtime monitors in safety-critical sys- tems. Previous studies do not connect requirements that describe the design domain and data specifications to requirements on the runtime monitors. Not many studies covered how the metrics could help in defining requirements in runtime monitors and training data. In our study, we explore these gaps by investigating the challenges in specifying both training data and runtime monitors for critical AI systems. 11 2. Background and Related work 12 3 Research Method In this chapter, we discuss the applied research methodology for the study. We performed a qualitative exploratory study. This chapter starts with an outline of the qualitative research method in Section 3.1 which is followed by a more detailed explanation of the different steps. 3.1 Qualitative Research Method A qualitative study is the process of understanding and exploring the problem by collecting data from individuals based on their direct experiences and analyzing them inductively with the aim to create themes by making interpretations of the findings. According to [Creswell and Creswell, 2017], a qualitative study is bound to be useful when the researcher tries to investigate and explore a problem without knowing the variables of the problem. A qualitative study tends to focus on different perspectives of different people by in- corporating the real-world context and their experiences. We chose the qualitative research method for this study because of the need for further exploration and a deeper understanding of the topic, specifying data and runtime monitors for critical AI applications. The study is set in a realistic environment. It also attempts to broaden our understanding of participants’ experience within the subject area, their views, and opinions and it explores the issues in the subject area that are not yet identified. As stated by [Creswell and Creswell, 2017], a qualitative research process is an emer- gent which means that the stages are not strictly followed in a sequential manner. Some of the stages might need to be revisited or altered when the researchers start collecting the data. The process of our research study involves six stages. The same can also be seen from Figure 3.1 • Planning, • Preparation for data collection, • Data collection, • Data analysis, • Evaluation, • Reporting study results. An initial literature review was performed to review existing research related to the 13 3. Research Method Figure 3.1: Different stages in the Research method topic of the study. It served as a preparation for the data collection by deepening the knowledge and broadening the understanding of the topic. The data collection stage includes collecting and recording information from the participants through semi-structured interviews. Then, the collected data were analyzed inductively us- ing thematic analysis. This was done to develop themes and, these themes were interconnected to form a qualitative model. The validation phase investigates the correctness and credibility of the findings. After validating the findings, the results of the study are reported. 3.2 Preparation for Data Collection This section outlines the preparation for data collection for this qualitative study. The subsections will describe the process and material created for the interviews, the initial literature review, and the justification of the samples used in this study. The main documents created for the interviews were the consent form and the interview guide. The consent form ensured the participants of confidentiality and informed them about how data would be handled and stored. This was sent out before each interview and signed by the interviewee before the interview. The interview guide contained the script of the interview and more specifically consisted of important information that should be given to the interviewee and the questions addressed during the interview. The interview guide was only available to the interviewers and not sent out beforehand to the interviewees. However, the interview guide was sent to one of the participants, based on his request, and got responses via mail. An actual interview was also conducted together with the participant on the same day. 3.2.1 Sampling Participants for this study were purposefully chosen with the use of the maximum variation strategy. Purposeful sampling is a common technique used for sampling in qualitative research method. This sampling strategy was chosen to properly inves- tigate and explore the topic by interviewing experts, in autonomous driving, with different experiences and opinions on the topic. 14 3. Research Method We chose the maximum variation strategy because according to [Creswell and Creswell, 2017] it is a good strategy to increase the chances of collecting data that will reflect the different perspectives. The participants were chosen based on their role, company, and availability to participate. To get samples that cover the different parts of the process, four different kinds of roles were identified and contacted: • Experts in the field, • Requirements Engineer, • People on the customer/user side. Examples of these are the product owner and function owner, • Researchers. The case company was responsible for contacting and finding interviewees for this study, based on the sampling strategy provided by the researchers. And, academic supervisors were also part in finding relevant participants for the study. The differ- ent kinds of roles requested were communicated to the case company, alongside some additional requests such as experience in working with safety-critical applications. At least ten interviewees were requested for the study to get a good and sufficient amount of data that explores the topic. [Marshall and Rossman, 2014] states that a researcher needs to be flexible when it comes to sampling because it can change during the study. Since the samples were also given during the data collection phase, adjustments to the sampling strategy were also made during the data collection phase to cover the spectrum of partici- pants that were requested. The final list of the participants, both from inside and outside the case company, is shown in Table 3.1. 3.3 Data Collection This section addresses the procedures used for data collection and the methods employed in the qualitative study of the project. The next subsections will detail the methods used for the conducted interviews, how the interviews were conducted and how the data were transcribed and organized for the next stage of the study. The data collection encompasses five steps: [Creswell and Creswell, 2017]. • The first step is to identify how to select the participants for the study, where we decide on who we need to select for the study. • In the second step, we obtain access from the case company and the necessary permissions needed for interviewing people. • The third step is to consider what type of data we should collect and the different options for collecting information. The methods need to be selected based on the type of data that is needed to answer the research questions at hand. • The fourth step is to locate, select and assess the necessary instruments, such as interview protocols and processes for gathering, recording, and storing data that is confidential. • The fifth step is to describe the procedures to administer the data collection process in collecting the data from interviewees. 15 3. Research Method Interviewee Code Interviewee Designa- tion Field of Work Interviewee 1 Postdoctoral Researcher Functional Safety and Machine Learning Interviewee 2 Research Specialist Sensors and Systems Interviewee 3 Principal Engineer Functional Safety Interviewee 4 Research Specialist Artificial Intelligence and Software Interviewee 5 Function Owner ADAS Interviewee 6 Simulation Engineer Real time simulation, SIL (Software In Loop) Interviewee 7 CEO Deep learning Interviewee 8 Safety Expert Global Safety Organiza- tion (Case company) Interviewee 9 Coordinator of VEDLIoT Project and Research Specialist FPGA (Functional Pro- grammable Gate Array) Computing and Software side of AI Interviewee 10 Global Head, Functional Safety Functional Safety Confir- mation Measures Table 3.1: List of Interviewees Steps one to four are described in the previous section already. There are four basic types of data collection procedures in qualitative research [Creswell and Creswell, 2017]: • Qualitative Observations 16 3. Research Method • Qualitative Interviews and Workshop • Qualitative Documents • Qualitative Audiovisual materials In this study, we employed interviews and a workshop as the method for collecting data from the participants and interviewees. With this method, we had complete control over the questions asked and the type of data collected. It also helped steer the questions more in line with the research questions as we gathered data which enlightened us more. 3.3.1 Interviews The interviews conducted in our thesis were semi-structured. Conducting these in- terviews was done to gather information and data on the thesis. This was done remotely with technical experts. However, the order of questions was altered dur- ing the conversation with the participants. The interviewer also included additional questions that allowed for a deeper understanding and exploration of the topic under study. The main objective of conducting interviews was to get an understanding of the current process and challenges experienced in specifying training data for critical AI systems, such as ADAS systems while relating them to runtime monitoring. The participants for the interviews were selected within the automotive domain from the case company (from different sites) and also outside the case company. Years of experience in their respective field of work was one selection criteria. We selected experts with 5 to 25 years of experience. They provided us with answers and inputs that were both relevant, and due to their work experience and expertise also reli- able. We conducted ten interviews in total in which, one of the interviews had two interviewees. The interviews were conducted remotely using either Microsoft Teams1 or Zoom2. Each interview session took about one hour. This study has been carried out by both authors of this thesis. We took turns asking questions during the interview, where one person is the interviewer and the other one observing. At the start of each interview, we presented some background details and gave an outline of the study’s objective and goal, as well as received consent for recording the interview. It was also informed that whatever information or data that the in- terviewees provide will only be used for the study. The interview guide’s questions were formulated mainly based on the research questions, only with the intent to find answers to these questions. The interview guide was divided into four different sections, each with a set of questions in itself. • The first section consisted of questions aimed at learning about the partici- pants’ current roles and experiences. 1An online communication and meeting tool, https://www.microsoft.com/en-ww/ microsoft- teams/group-chat-software 2Another online communication and meeting tool, https://zoom.us/ 17 3. Research Method • The second section focused on establishing the concepts with the participants, to prevent any misunderstanding about the study and made clear what data or information we are looking for. In this section, we gather information on training data and how it’s decided to select and create requirements for the training data. For some questions, examples were given to assist participants in answering the question. • The third section explored the incorporation of safety into an ML model. It aimed at understanding how the ML model affects a safety-critical system and the processes/standards involved that we need to adhere to. • The fourth section tried to bridge the relation between training data and runtime monitoring. During the interview, the order of the questions was altered slightly and some ad- ditional follow-up questions, in some interviews, were included as well. At the end of each interview, the participants were also informed that there might be potential follow-up questions and they would be sent via email. Few of the participants were also invited for a follow-up workshop. All ten interviews were recorded and field notes were taken when needed. Each recorded interview was then transcribed in two ways. The recorded video was auto- matically transcribed into a word document, using the Microsoft Teams transcrip- tion service. Timestamps and words that were unclear were specifically marked in the transcripts. Afterward, we still manually listened to all the interviews and cross-verified the transcripts to ensure no wrong transcriptions were made. Punc- tuation and proper spacing between sentences were added for clear understanding. The transcripts were anonymous and were created to be used only for analysis and for the purpose of this study alone. The transcripts of initial interviews made us familiar with the data and also helped in refining the interview questions for the forthcoming participants to increase the quality of the data. There were no major changes to the questions except slight modifications. 3.3.2 Workshop The workshop was our other method for collecting data. The participants for the workshop included researchers from the case company, fellow researchers from VEDLIoT, participants from the interviews, and the supervisors of the thesis. The idea behind the workshop was to gather all participants and discuss potential prob- lems during the analysis and collect feedback and suggestions for the research. We also aimed to validate our themes in the same workshop without having a separate session. The workshop was conducted on May 25, 2022, in a one-hour session. There were a total of 16 participants who attended the session. Here we also brainstormed to see if there are any more possible themes apart from the predefined ones. To make the workshop more interactive, since it was conducted remotely we used Mentimeter, where we presented all the themes to the participants. All the par- 18 3. Research Method Figure 3.2: Steps in Data Analysis ticipants were given access to the presentation. Initially, the researchers briefly explained the overview of the study and the purpose of the workshop stressing more the research questions that need to be answered. All the participants were also asked to rate their experience in the different fields of engineering ranging from Re- quirements Engineering to AI system development. They voted from No experience to Very senior (10+ years) experience in the different fields. The results were pre- sented and the participants were asked to provide comments and suggestions on the collected results. We also asked them to rate from a strongly agree to a strongly disagree on the provided themes. In the end, we asked their opinions on the most relevant and important questions related to RQ1 and RQ2. 3.4 Data Analysis In this section, we analyzed the gathered data into further meaningful ones. Coding practices and guidelines helped us categorize them into themes [Saldaña, 2021]. This process included four stages as shown in Figure 3.2. • Pre-Phase, • First Cycle Coding, • Second Cycle Coding, • Conclusion. 3.4.1 Pre-Phase During Pre-Phase, we performed an initial pilot coding which helped us to under- stand which coding methods will help us in our study. There were several coding methods to use, but we started with generic coding methods, such as Attribute Cod- ing and Descriptive Coding but were also open to changes if these don’t generate substantive discoveries for us. Now, we get to investigate which of these coding methods help us further in our research. As mentioned by [Saldaña, 2021], we first tested the following coding methods - • Attribute Coding, • Structural Coding, • Descriptive Coding, • In Vivo Coding. 19 3. Research Method All the collected transcripts were equally divided among the researchers and both of them performed coding independent from each other. This pre-phase stage of analysis was done in Microsoft Excel. Once it was done, the researchers then came together to compare their codes. After thorough proper analysis and evaluation, the researchers discussed and decided on the potential methods to actually adopt. 3.4.2 Tools used for Coding After having used Microsoft Excel for pre-phase analysis, it was very difficult and tedious to manually perform the complete analysis. Hence, we found an interesting tool called Atlas.ti which helped us very much in coding. This Atlas.ti was a licensed version of the tool which the university provided us based on our needs. This tool had plenty of features that made our work easier. It would have been difficult if we had stuck to the Excel way of working. We can import and export various data formats in the tool and start our coding. Data formats ranging from audio, video, and transcripts in the form of word or excel can be imported. We can either create our transcripts using the audio/video files we have or import our already existing transcripts. Once we start coding we can merge them, categorize them into themes or even replace them with a new one. It helps us to create word clouds and word lists where we can explore the content even deeper. It has easy search and retrieve options. After categorizing data, we can create charts and diagrams to visualize the relation between them. We also faced certain issues with licenses which were then sorted out with support from the university. An example of the tool’s graphical user interface is shown in Figure 3.3. 3.4.3 First Cycle Coding During the first cycle, we employed the above-mentioned four types of coding meth- ods. Attribute Coding was usually used at the beginning of a data set to collect information on participants’ characteristics and demographics, such as their role, experience, and time frame of the interviews conducted. Structural Coding helped us to collect contents or phrases which represented a topic of inquiry to a segment of data, especially pointing to the research question which was used to frame the interview [Saldaña, 2021] Descriptive Coding helped us assign basic labels to data to provide more mean- ing to the topics gathered. Here, we then summarized the topics into a word or phrase, such as a noun. In Vivo Coding method, also known as literal coding or verbatim coding, helped us categorize data based on the actual phrases used by the participants, in their own vocabulary and language. This method helped us to enhance and deepen the understanding of the research questions at hand. 20 3. Research Method Figure 3.3: Atlas.ti Tool 3.4.4 Second Cycle Coding Before jumping into the Second Cycle Coding methods, we performed a mapping that gathers the initial themes with the relative codes. Second Cycle Coding meth- ods were advanced methods of reorganizing and reanalyzing data. Our goal here was to develop a sense of the different categories as themes created from the first cycle of coding. An important point to note is that, with each cycle of coding, the number of codes should be less and not more. Here, we adopted two coding methods namely Pattern coding and Focused cod- ing. Pattern coding helped us develop the "meta-code", which was to identify the similarly coded data to identify an emergent theme, configuration, or explanation into a smaller number of sets or themes. By identifying similarities, we not only organize our content but also provide meaning to the research questions. Focused coding helped us to figure out the most frequent or similar codes to develop the most important categories, without paying attention to their properties and dimensions. This coding also follows In Vivo Coding method from the first cycle of coding. 3.4.5 Conclusion After we started our coding we ended up with a code list. Then, we started realizing themes for the codes, which were clear and understandable based on the generated codes. The resulting initial themes helped us to create a first draft of a fish-bone diagram. A fish-bone diagram is a cause-effect diagram, and our codes represent causes that influence, or contribute to the identified theme. The assignment of codes to themes was then reviewed together with our supervisors. Based on their feed- back, we revisited some of the causes and updated the fish-bone diagram with better and proper naming of the causes. The fish-bone diagrams that we arrived at was 21 3. Research Method presented to the participants during the workshop. This enabled us to revisit some of the themes again based on the participants’ thoughts and views on them. We also had another session with our supervisors after the workshop where we finalized our fish-bone diagrams. The final diagrams are given in Appendix D.1 and D.2. Atlas.ti has proved very useful for efficient coding analysis and categorization. Choos- ing this over Excel was really a wise decision. 3.5 Validation After the second cycle of coding was done, we gathered all the categorized themes and reviewed them together with industry experts in a workshop. The participants for this workshop included researchers from the case company, fellow researchers from VEDLIoT, participants from the interviews, and the supervisors of the thesis. As mentioned above, we also performed validation during the same workshop. This helped us to validate the findings from the different interviews conducted. Men- timeter was used to gather the needed information from the participants. All 16 participants were involved in the validation phase. There are various strategies in how we can validate our analysis [Creswell, 2007]. They include, • Member checking • Triangulating sources of data • Using a Peer or External auditor We used member checking to validate our themes together with our supervisors and research specialists. We determine the accuracy of our findings as we validated our qualitative anal- ysis. To help better quantify the level of agreement among the participants, we used the Likert scale and asked the participants to rate their relevance to our research from Strongly agree to Strongly disagree. This was requested for both the research questions. Later, the participants were asked about further challenges they have en- countered or known, which were not covered in our interviews. Most of them agreed with the already presented themes but some additional responses were also received and discussed. The additional data were analysed and added to our existing cause and effect diagrams after discussing with our supervisors. This was later included in our study results. 22 4 Results In this chapter, the results of the research questions are presented in detail. The chapter begins with Section 4.1 which elaborates on the results of the challenges in deriving requirements for the training data of AI components. Then, it is followed by the results for the challenges in defining requirements for run-time monitors in Section 4.2. The themes identified through thematic analysis for each research ques- tion are presented in the respective sections. Furthermore, the responses from the workshop are also included. 4.1 Challenges - Requirements for Data Selection This section presents the results in regards to the challenges encountered when de- riving requirements for AI components concerning the selection of training data (RQ 1). The interviewees were asked to share their thoughts and experiences about the training data that they use in their work. They were also asked in detail about the process of selection of training data specifically for a safety-critical system. Then, they were asked to share their opinions and thoughts regarding the procedures and challenges in deriving requirements on it while incorporating the safety standards and procedures. The following sub-sections present the challenges identified in de- tail. 4.1.1 Difficulty in handling the amount of data When planning for data collection, a researcher mentioned that it is important to know the interesting type of data for training, to make the best use of the experi- ments. More training data will eventually take longer training time, and thus invoke higher costs. It is essential to have a plan to decide on how to reduce the amount of training data to avoid the problem of cost and time. This was supported by another interviewee, saying that it is not only expensive to pay for the model car, equipment, and salary for the driver to collect data, but it is also a logistical problem to manage all the collected data. Each kilometer of ride results in quite a bit of information which also needs a high internet infrastructure to pass data from a vehicle to the data center. 23 4. Results "...Uh, and the data rates are so large that we cannot use ordinary In- ternet infrastructure to pass data from a vehicle in the fleet to the data center because the one Gigabit connection doesn’t hold up at all." - Interviewee 1 Few of the interviewees preferred to have more training data and were willing to undertake longer training time. Also, one of the interviewees mentioned that it’s a trade-off between large data, time, and cost. Another interviewee mentioned that it is a challenge to obtain the right information exactly with the right amount of data and the right amount of variety. He also quoted this challenge as "it is a two-sided optimization" - Interviewee 1 4.1.2 Finding the right variety of data When asked about how they find the right variety of data in industry and research, a research specialist gave his opinion in both industry and research contexts. He mentioned there is a slight difference in what they are looking at in research and production. The researchers focus on edge cases whereas, in production, they look at the entire spectrum of events in traffic. The edge cases in this scenario mean some events which are very interesting to analyze and which are less frequent than others in the distribution of traffic events. He further added that they try to avoid normal driving events or behavior. And additionally, they try to find other solutions besides modifying the AI model which can mitigate these complicated edge cases. They try to reduce the amount of data that the production team can use to build the product. Another interviewee added that when considering the variety of the data it is im- portant to have variations in different scenarios or use cases. For example, data with people running on the road, different people, different weather conditions, and different types of the road but within the same scenario on the street. "The variety of the data but also difference in like say scenarios or use cases, let’s say you have people running on the road, you wanna make sure that you have data with people running on the road. Given the scenarios, that should be enough variation: Different people, different weather conditions, different types of roads still within the same scenario that people are running on the street." - Interviewee 7 Interviewee 5 expressed that the quality of the data is affected by the variety of the data set. The interviewee also mentioned that it is important to consider this aspect 24 4. Results Figure 4.1: Challenges in finding the right variety of data - Cause and Effect analysis during the data collection. "And yeah, and you said by variety, it could be like a black cat with long fur, a black cat with. Shorter fur is this, things like that. Maybe if you take, for example, the roads, they can be different kinds of post boxes. And then there can be different kinds of a lamp posts or something like that. So does the variety of data affects the quality there’s a variety of data that makes an impact. Yes, but I also think that you should consider it. Uh, when you need to have that variety of data." - Interviewee 5 When asked about the measures for a variety of data, it was mentioned by Intervie- wee 1 that it is difficult to arrive at a suitable metric for the variety of data. It seems that such a metric must be negotiated between the stakeholders. Also, he added that statistical approaches could be a good way for example entropy as a measure. Since it explains how different the data is when a new set of data is encountered. When there is a huge variety, it changes the entropy. It was again stressed that it is extremely difficult to have a measure to check the variety of data. "...You mean like a measure of variety? Yeah, that is extremely difficult. I don’t really have an answer." - Interviewee 1 A fishbone diagram for finding the right variety of data is analyzed and shown in Figure 4.1 It was mentioned that it’s also expensive to have the varieties in the data since resources are often limited. One of the interviewees stated that there is a correlation in the data which makes one feature dependent on another which is very important and should not be neglected. 25 4. Results 4.1.3 Finding data with the right information content Furthermore, we asked the interviewees about the process of determining the kind of information the training dataset should contain during data selection phase, for example data should contain a certain number of hours in darkness. One of the interviewees stated that statistical measures could be used for the machine learning model to avoid bias in the model and further added that it’s the second part of the quality aspect of the dataset. It was mentioned that the biggest challenge is to get the right data having the right variety and the right amount of data rather than just obtaining enough data itself. The interviewee further stated it as "a two-sided optimization". "It’s not necessarily a challenge to obtain enough data, but the right data means having exactly the right amount of variety and the right amount of data. So I think it’s a two-sided optimization." - Interviewee 1 "Data overload" was mentioned as a challenge as it is important to filter and find interesting data to be added to the data lake. The interviewee further added that the challenging task is finding "the interesting data". "One challenge is data overload that you have to be filtering and find what you’re interested to log in and dump it into a data lake. But then how do you find the interesting part?" - Interviewee 6 4.1.4 Clarity in defining requirements for data Five of the ten interviewees expressed their concerns about the clarity needed when defining requirements for data. One of the common issues as stated by a function owner is a lack of proper communication with the customers. When asked a ques- tion, the answer is given only for the question asked and nothing more. There is no additional information provided. So, this can lead to misunderstanding and confu- sion concerning requirements. An example was given that if a question is asked to identify a black cat, then what is the difference between a gray cat and a black cat? Furthermore, if asked, if it is a male or a female cat, then a different response is provided again. This is a standard case of misunderstanding or miscommunication in requirements elicitation. It could be an issue in ML as well, since there needs to be a perfect understanding of the requirement under analysis. For example, sen- sor images or camera images that are used to detect the different objects must be clearly stated in the requirements. Hence, it is stated that the requirements are to be properly understood and drafted. "...you get the answer based on your question. So if you ask for a cat then you get one answer. But if you ask for a black cat, you get another 26 4. Results answer" - Interviewee 5 Another interviewee suggested that the challenge starts when defining the opera- tional environment of the machine learning model. If the environment in which the vehicle is to be deployed is unknown, then it will be difficult to draft proper require- ments. There might be some information and assumptions on what the operating environment would be. In reality, there is always a chance that the environment might change due to unforeseen circumstances. But still, the basic conditions re- main. This is important to consider to write a proper requirement. "So because that is not really solved how to clearly specify the operational environment of your machine learning model, it is difficult to clearly write a data specification" - Interviewee 1 One of the interviewees stated that not all requirements are hard. Some are simple enough and easy to understand while others are quite tricky and hard. It is impor- tant to consider the scenario parameters that the actual production site uses. This also applies to different weather conditions and markets where the vehicle will be deployed. Of course, it was added that standardization needs to happen for these data as it will be hard to keep track of different parameters for different vehicle variants. "When we look at the data that at the production site actually uses. I mean they have defined different scenario parameters. There is some standardization being going on in this area to try to define all of these" - Interviewee 2 Two of the interviewees mentioned that it is crucial to use representative data when creating requirements. When working with customers, it is already provided and stated clearly the parameters and conditions to be considered. Some examples were stated including, the amount of traffic in the highway and suburban areas, when darkness falls, rain, wind, and snowy conditions, etc., and of course, this applies to all the vehicle markets. Adding to this, it was also mentioned that suppliers request customers to provide some preliminary analysis of data from their test vehicles and simulations. This helps to draft some realistic requirements based on the data provided. "you do some analysis... You’d find some weaknesses or spots where the AI has essentially too little information and from that, you will usually produce a small list of reasonably or realistic requirements" - Interviewee 9 It was emphasized by one of the interviewees that there needs to be a sync between the specification that is designed on paper and the ones seen in practice. There 27 4. Results might be several variations from theory to practice. Hence, it is important to have proper synchronization between them to properly draft requirements. 4.1.5 Applying safety requirements (e.g, from safety stan- dards) for data When dealing with safety-critical systems, it is also important to follow the safety standards accordingly. The standards also suggest some strict guidelines on writ- ing requirements. One of the interviewees stated that it is important to implement them right on how the requirements were stated. Safety requirements are not to be treated lightly as they require the developers to implement them according to standards. When we have bigger control flow monitors and implementation across different nodes, it might become tricky to see if it would work as intended. As supported by another interviewee, practitioners need to implement according to safety standards as close as possible. This is done to protect the system from two types of faults: Systematic faults and random hardware faults. Systematic faults can be software bugs based on implementation whereas random hardware faults could be a bit flip or an issue with hardware components. Hence, it is enforced to adhere to the standards when deriving requirements and implementing accordingly. The standards, as stated by one of the interviewees, that companies, in the automo- tive sector, need to adhere to are the ISO26262 and ISO21448 (SOTIF) standards. But these standards do not support AI, ML, or neural networks. They only support conventional electrical/electronic systems. Hence, the use of new standards such as ISO TS 5083 (Technical Specification) and ISO PAS 8800 (Publicly Available Spec- ification) are demanded. These standards are also complementary to the previous ISO26262 and SOTIF standards, which means the foundation is similar. "...they want to compliment ISO 26262 and ISO TS 21448 for SO- TIF.But when you increase automation, obviously we have to learn on the way..." - Interviewee 10 The limitation of ISO26262 was stressed by multiple interviewees that this standard does not work with probabilistic models.Figure 4.2 shows a fishbone diagram of the challenges in applying safety requirements to data. 4.1.6 Missing guidelines for data selection When asked about guidelines for proper data selection, one of the interviewees stated that they do not follow any specific guidelines for data selection. They employ reg- ular processes and tools, such as Tensorflow or Pytorch for training the algorithms. However, another interviewee mentioned the opposite, stating that there have to be strict guidelines for data selection. Data is to be selected based on the target 28 4. Results Figure 4.2: Challenges in applying safety requirements to data - Cause and Effect analysis environment in which the model is to be deployed. The interviewee also stated that the data used in the production site might not be the same as the data in the earlier stage. "...when we look at the data that the production site actually uses. I mean they have defined different Scenario parameters" - Interviewee 2 Hence, it is necessary to have the data used on target even in the initial stages. Another interviewee stressed the point that completeness criteria are missing since the data selection process should address completeness in the information such as how much it should be in dark, how much of the data should be in the rain, snowy conditions, etc., It was also mentioned that the data that are annotated and received from the cus- tomer should also have guidelines before using to create systems or optimize the systems. 4.1.7 Unclear design domain / context definition Some of the interviewees mentioned that the training data needs to have different variations, for example, people of different colors, different traffic signs, different symbols and different types of roads, etc., because these different data could appear in different places which need to be included. If it’s not identified, then the inter- viewee mentioned it as an issue, which needs to be covered in the design. All these are a part of the ODD which is important to be noted. As stated above, one of the interviewees emphasized the challenge of defining the right operational domain for creating proper requirements. This is also a challenge if the scope of the design domain is not clearly defined. Several conditions come into play when defining the design domain. Certain examples were stated namely traffic sign types, different symbols, and arrows used, types of road and vehicles, 29 4. Results Figure 4.3: Unclear design domain - Cause and Effect analysis passengers on the street, etc., A cause and effect analysis has been done as shown in Figure 4.3 When asked about the design domain, an interviewee mentioned that communi- cation with the customers plays an important role in defining the ODD. Since the customers give the ODD which is translated to requirements. Communication defi- ciency is a challenge when defining the context. "...people who are walking, cycling, other vehicles, motorcycles, bikes on the road, all road users and some others. Like I said weather conditions are also one thing. So these are all part of ODD" - Interviewee 10 All these constitute the design domain for an AI or ML system. When working with suppliers, they just use the conditions defined by the customer and nothing more. 4.2 Challenges- Runtime Monitoring To investigate the role of runtime monitoring in defining safety requirements, it is essential to understand the interviewees’ opinions on runtime monitoring and its importance. They were asked to share their views on the status of runtime mon- itors for AI systems in their respective work. They were also asked to share their thoughts and experience on the possible runtime checks and the challenges in setting requirements for them. This section presents their responses in detail. Based on the interviews, eight different challenges were identified: ’Difference in understanding of runtime monitors’, ’Being Time Critical’, ’Keeping it lightweight’, ’Lack of access to inner states of model’, ’Difficulty in finding Conditions that can be checked at runtime’, ’Impact of Safety standards’, ’Defining metrics for runtime checks’, ’Trade off between safety and reliability’. The following sub-sections present each challenge in detail. 30 4. Results 4.2.1 Difference in understanding of runtime There were different understanding of runtime monitoring between interviewees but there were similarities as well. One of the interviewees stated that runtime monitoring takes noise into account for the sensors used. Hence, the boundary within which it needs to operate includes the margin for noise as well. What is important to check is whether the assumed noise characteristics are indeed correct or not. Runtime monitoring helps us to check this data. One of the interviewees mentioned that runtime monitoring is used to identify the weak areas and critical use cases. Since the system is designed within its boundaries and parameters, there might be scenarios that the system has never encountered. Hence, it is important to find these critical test cases and train the model for good. "So I would I would find some like critical use cases or critical test cases and try to align the real-time data with the machine learning" - Interviewee 5 Interviewee 9, working with VEDLIoT, mentioned that runtime monitoring is mainly used to gather more training data. They partly align with others’ opinions, where it is interesting to collect only the exceptional ones and not the normal day-to-day events. Things that happen all the time are already modeled into the system and the ones that matter are the ones that are not in the system yet. Another interviewee said that runtime monitoring is used to avoid bias with training data and to get the ground truth, which we cannot rely on training data. 4.2.2 Being Time critical Several interviewees stressed the importance of runtime monitoring and the impor- tant elements to be considered for it to work properly. One of the interviewees stated that timing is critical when it comes to runtime monitoring in safety-critical systems. A lot of the applications that run using ML models are time critical. This is also part of the structure when defining requirements for runtime monitoring. It was also mentioned that we adopt certain safety standards, namely ISO26262 to design the systems. As suggested by the standard, it is crucial to include timing aspects as part of the requirements to fulfill them. These are called Fault Tolerant Time Interval (FTTI). This is then further split into Fault Detection Time (FDT) and Fault Reaction Time (FRT). As stated by the interviewee, Fault Detection Time is the time within which a fault is to be detected and confirmed. Fault Reaction Time is the time within which the safety system needs to react. The system needs to react within this time for example by reaching a defined safe state. An example was provided where the vehicle sees a huge obstacle suddenly appearing in front of the vehicle, then the 31 4. Results system needs to trigger the brake within, say 20ms, especially when ML is involved. If this is not achieved, then it might result in a crash. Hence, any safety-critical task that is executed needs to be done within the allowed time portion. Otherwise, this can lead to potential hazards. So, it’s important in runtime monitoring to fulfill the timings stated in the requirements. "...that’s sort of, you know. How we could achieve the fault tolerant time intervals... and this is runtime monitoring" - Interviewee 10 One of the research specialists we interviewed stated that the timing aspect applies to both hardware components and also to the software that executes within them. Focusing on hardware components, which could be sensors, processors, microcon- trollers, etc., all these need to function within their defined parameter restrictions. Be its clock speed, processor speed, frequency of execution, or voltage levels. Every component has its datasheet which they must fulfill and never violate, especially safety-critical components. It was added that every programmed software should be executed within its scheduled time. This can be accessing data in the right memory location at the right position. Everything from latency to accuracy must be consid- ered when implementing runtime monitors. "We have added runtime monitoring for our software so we make sure that every software module runs into the amount of time that is..." - Interviewee 2 Another participant we interviewed stressed the importance of timing for safety- critical systems in taking fast and quick decisions. Certain examples were mentioned as driving from one lane to another and moving from Point A to B with different road conditions. In these situations, the vehicle which has AD capabilities cannot rely on the driver for all actions. Hence, the system is to be designed to take control and necessary actions when needed. Some situations could be life critical while some may not be. Nevertheless, the underlying system should function as intended and take decisions as quickly as possible. "As a AI component to be able to reduce the speed, or even maybe do a different strategic decision... that’s also tied I think to runtime monitor- ing and assess your capabilities as a AD system because you can’t rely on the driver anymore..." - Interviewee 8 4.2.3 Keeping it lightweight Another important aspect brought by one of the interviewees is to have the runtime monitoring as light as possible, in terms of processes, resources, cost, and execution. Multiple interviewees mentioned that there is always a trade-off between safety and cost. As stressed by the interviewee, the cost is one of the major factors of consid- eration across all organizations implementing runtime monitoring in their systems. 32 4. Results Figure 4.4: Difficult in keeping it lightweight - Cause and Effect analysis In terms of software footprint, memory cannot be used too much. To use so much memory out of its capacity can be very expensive. If the monitoring system is so expensive in itself, it is worth putting the money into a different hardware system altogether. Hence, it has to be lightweight. "...it cannot be too expensive, the monitoring. OK, so it should be lightweight for sure." - Interviewee 7 It was also stated that the runtime monitoring cannot take up too many resources. When there are several tasks scheduled in the pipeline there is a chance that one of them might get stuck in a loop or not execute. This might then trigger the system to shut down which is safety critical. It also affects the performance of the system to a great extent. Also, one of the interviewees mentioned that for economical reasons, the monitor has to be much smaller and doesn’t have to have a good performance figure compared to a very high-performing CNN. This in turn decreases the true positive rate when passed through the second opinion goal. It was further added that from a safety point of view, passing through the second opinion is good but it is not very good from the performance point of view. So, it is good to keep the system and processing as light as possible. It was also added that the runtime monitoring is a probabilistic system that could go wrong at times. A fishbone diagram on keeping the system lightweight is seen in Figure 4.4 "...the pipeline order... for some reason it will be stuck in a loop or something, and then the operating system can try to enforce shut off. So that’s one point that it needs to be light, that’s a requirement on the monitoring system itself..." - Interviewee 7 33 4. Results 4.2.4 No access to inner states of model Two interviewees mentioned that the system does not provide any information on how the runtime monitoring works. In vehicles with ML and different levels of AD, it is hard to figure out how everything works in the system, especially with neural networks involved. Simply said, one can only see what goes in and what comes out. "...it’s a neural network in place. That neural network is something black box in itself." - Interviewee 6 If, for some reason, an error occurs the root cause will not be known. It does not say if the error is a classification error, planning error, or execution error. The simple fact is that it is a black box in itself. It incorporates several hundreds of parameters but still, it does not say what it does. "you can only measure its performance. You cannot explain or reason about its behavior" - Interviewee 3 4.2.5 Finding conditions that can be checked at runtime Keeping this in mind, several interviewees stated the conditions and parameters that are to be considered when implementing proper runtime monitoring. Three interviewees stressed the fact that the sensors are to be properly placed and posi- tioned when installing them. The sensors used to gather inputs are to be properly taken into consideration. If the source of your input is wrong, then what you get after will not always be right. Hence, the corner cases for these sensors, be it dirt, positioning, or blockage of some sort are all to be taken into account as parameters during runtime monitoring. A fishbone analysis can be seen in Figure 4.5 "Maybe you get some kind of blockage in the sensor due to that you can find like the critical use cases and start to use machine learning there to see what can be done to make the system more intelligent.." - Interviewee 5 It was also added that identifying these weak points beforehand serves the purpose. Two interviewees mentioned the fact that there is always certain noise when using sensors in the field. Every sensor has its boundary conditions that it needs to fulfill. It should never violate the boundaries and must operate well within them. It was also stated that the sensors degrade over time and age over continuous usage. This then calls for re-tuning the machine learning model with updated parameters so as to extend these boundaries a bit further, to match reality. 34 4. Results Figure 4.5: Challenges in finding conditions that can be checked at runtime - Cause and Effect analysis "Every sensor has some sort of noise. So we have to make sure that the the system operates with this noise. It can’t be more and it can be less actually. So we have to make sure that this is within the boundaries" - Interviewee 2 Another interviewee added to this stating that these sensors must execute within a well-defined threshold, in the environment that it executes. A researcher mentioned the fact that the changing environment is the reason for runtime monitoring. When the environment changes over time, we might need to retrain the model again to reflect the new environment. It was also stated that another reason to have runtime monitoring is to check the performance of the system, to see that the system is behaving as we have trained it to be. "In our scenarios, it’s kind of key that the whole process like from the transmission to the execution state below a certain threshold" - Interviewee 4 One of the research specialists stated that there has to be a certain statistical anal- ysis done on the sensors before deploying them on the field. This statistical analysis provides information on the different types of conditions for the sensors to be used in different road and weather situations, especially radar. If this is analyzed before- hand, then when a person drives, say in the middle of the desert, the model does not get surprised, as this will already be known. If there are no objects to detect, then the driver might assume that there is an error with the radar. The analysis can be performed with a combination of different types of sensors used on the market, using them day and night, taking all of them into consideration. "We expect that there is at least something there that we can see with the radar and we see, we do some sort of statistical analysis of we know exactly what this is, how this should look like in a statistical sense" - Interviewee 2 An interviewee that we interviewed had a different perspective that runtime mon- itoring could introduce errors into the software since it is a probabilistic system. This means that the data we receive is not completely correct, all the time. Also, 35 4. Results adding to this, it was stated that runtime monitoring helps in finding the right fail- ure models. Be it via simulation or vehicle testing, we can know which ones are false positives and false negatives. A participant we interviewed, mentioned that there has to be a proper classification of objects during runtime monitoring. He added an example stating that if the system classifies and identifies a person standing at a pedestrian crossing but suddenly it classifies it as a bicyclist later, can pose serious risks. This is important to consider beforehand as it is safety critical. Another research specialist complemented this by adding that geographic locations also matter during classification and monitoring. It really matters if the system clas- sifies a person in Europe and suddenly it is deployed in a different country where driving directions differ, then the system might not react the same. It was also stated that all the markets where the vehicles are to be deployed are to be considered as part of training data and runtime monitoring. Hence, verification in the different design domains is a necessary need to have the right data in place. This is also not covered by the ISO21448 SOTIF standard. Hence, a need for more advanced and new safety standards arises. "We are creating a model which detects the person which has only people from Europe and if you’re putting the data in some other part of the country, we haven’t trained enough on, obviously that is an issue.. basi- cally in the SOTIF analysis, we haven’t done that, it’s not acceptable... " - Interviewee 2 Two interviewees mentioned that the amount of training data used is also impor- tant as a pre-condition for runtime monitoring. With training data, we have a huge distribution of possible events that can affect the system. Hence, it is important to take careful consideration of the amount of training data used. Finding a relationship between the training data used and the actual data the sys- tem faces in real time was also stressed. Hence, it is clear that too much training data makes it difficult to monitor during runtime. 4.2.6 Trade off between safety and reliability A researcher stressed that if the system is too safe, then it might not be very reliable. It was stated that when a system is implemented by incorporating safety require- ments and standards, the system tends to have safe state. For example, in the case of Automatic Emergency Braking system, a safe state is one that usually makes the system unavailable. This is determined based on critical faults in the system. The researcher further added that it is also important for a system to be reliable. For example, having the right performance and speed or driving continuously without causing any unnecessary interruptions (such as "switch off" states to the system) to the driver. Hence, the participant argued that too strict monitors will reduce the reliability of the system. 36 4. Results One of the principal engineers stated that it also affects the performance of the system. A system has both true positives and false positives. When the system is not able to achieve the desired goal, it has to loop through a second opinion goal. This then decreases the true positive rate. However, safety is achieved and ensures good coverage. 4.2.7 Impact of Safety standards A research specialist stated that the use of safety standards applies to both the machine learning model and the runtime monitors. We cannot just consider one thing for safety. Implementing safety standards for both of them combined is the only solution for a safe vehicle, as stated by the interviewee. "..so the safety is now moved from the model to the monitor instead, and it shouldn’t be. It should be the combination of the two that makes up the safety." - Interviewee 2 It was also added that the geographic and demographic conditions are not properly trained enough for runtime monitoring to work effectively. Even the SOTIF stan- dard is not equipped to handle these data completely. It was clarified that these are normal day-to-day things that should have been considered during the initial system design itself. When we miss to include this information in the first place, then we choose runtime checks and adopt safety standards for them. New and advanced safety standards are necessary to be implemented. Another principal engineer that we interviewed mentioned the fact that the true positive rate has an impact on the performance of the underlying system. When the system is unable to make decisions such as identifying a vehicle far beyond its reach, it needs to loop through a second opinion goal which decreases the true positive rate. This then decreases the performance of the system. However, from a safety point of view, there is good coverage. "..the true positive rate is actually decreasing when you have to pass it through this second opinion goal. It’s good from a coverage and safety point of view, but it reduces the overall system performance. It’s a safer not so very. Performance oriented. Yeah, it limits the performance." - Interviewee 3 One of the interviewees stressed the difficulties with freedom from interference when adopting safety standards for system solutions. The normal function QM1 and the 1"Quality Management", the level QM means that risk associated with a hazardous event is not unreasonable and does not, therefore, require safety measures in accordance with ISO 26262. 37 4. Results Figure 4.6: Unclear scope and impact of Safety Standards - Cause and Effect analysis safety function (ASIL)2 in the system should have separate communication channels and memory protection mechanisms. If they run on the same software component or memory partition, then it is safety critical. Hence, it’s important to have freedom from interference during runtime monitoring. A cause and effect analysis is done as seen in Figure 4.6 4.2.8 Defining metrics for runtime checks One of the interviewees that we interviewed, stated that they have not gotten far enough with runtime monitoring, and are not using any metrics for it. However, the interviewee provided several thoughts on this. The interviewee expressed that there can be systems and checks to validate physical effects like the dirt in the sen- sor, blurriness of the cameras, image resolution, etc. Therefore, it is good to have metrics where we can easily translate physical effects into measurable events. Adding to this, a research specialist added that there is a lack of degradation models for the hardware used. For example, a camera, as a sensor, can degrade over time, losing pixels and resolution. It would be good to have a metric that can measure this over time so that we can replace the hardware before it gets damaged or becomes a potential hazard. "I would think essentially about performance degradation in the environ- ment of the system and try to prepare the system for that by simulating this and see whether it still works" - Interviewee 9 2Automotive Safety Integrity Level (ASIL) is a risk classification scheme defined by the ISO 26262 - Functional Safety for Road Vehicles standard. The ASIL is established by performing a risk analysis of a potential hazard by looking at the Severity, Exposure, and Controllability of the vehicle operating scenario.https://en.wikipedia.org/wiki/Automotive_Safety_Integrity_Level 38 https://en.wikipedia.org/wiki/Automotive_Safety_Integrity_Level 4. Results It was mentioned that one of the common metrics used is the background confidence check. If the computed confidence value exceeds or lies below a threshold then it is a new dataset. This is something that is to be investigated. These are also called parallel predictions, which are basically a comparison between two models. Another interviewee argued this point stating that there is a lack of confidence measures. With the system design in hand, it is important to analyze the potential failures one by one in order to get more confidence. One of the interviewees suggested hav- ing a method that can prove that the defined metrics are indeed good and worthy. Another interviewee wanted to have a metric that measures the reliability of the runtime m