Designing in-vehicle voice assistants Creating safer, integrated driver experiences Master’s thesis in Computer science and engineering CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2019 Master’s thesis 2019 Designing in-vehicle voice assistants Creating safer, integrated driver experiences CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2019 Designing in-vehicle voice assistants Creating safer, integrated driver experiences CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN © CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN, 2019. Supervisor: Fang Chen, Department of Computer Science and Engineering Advisor: Jenny Wilkie, Volvo Cars Examiner: Staffan Björk & Olof Torgersson , Department of Computer Science and Engineering Master’s Thesis 2019 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: In-vehicle infotainment system with prototype voice assistant interface. Typeset in LATEX Gothenburg, Sweden 2019 iv Designing in-vehicle assistants Creating safer, integrated driver experiences CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Voice assistants are increasing in popularity with the rise of devices like smart speak- ers and screens. As people grow accustomed to using these assistants, it is likely they would want use the same voice assistant in their car. Many modern cars already support integration of voice assistants from both Apple and Google. In this project, voice assistants integrated into the vehicle and their effects on safety in terms of increased diverted attention and cognitive load are examined. Current voice assis- tants are also reviewed. Apple Siri and Google Assistants, two commercial voice assistants, are evaluated under the conditions of manual driver, as well as with longitudinal and lateral assistive drive features. New, improved design solutions and guidelines were evaluated through two prototypes with different approaches to solving found problems in existing voice assistants. The results indicate several similarities and differences in the existing design guidelines for the different voice assistants. Users provide input and thoughts about the existing solutions. New de- sign solutions for decreasing distraction and cognitive load are presented. These new solutions can help continued research and further improvement of voice assistants within cars in the future to come. Keywords: Voice Assistant, Voice Interaction, Driving, Safety, Attention, Cognitive Load, Design Guidelines. v Acknowledgements We would like to thank all personnel of Volvo Cars who in any way participated and helped with the project. We would like to especially thank Jenny Wilkie for her expertise and all her supervision, feedback and guidance throughout the project. We would like to thank all test participants who participated in our studies. We would like to thank Chalmers for providing us with material and equipment and lastly, we would like to thank Fang Chen for her supervision throughout the project. Connie (Khanh) Nguyen & William Falkengren, Gothenburg, June 2019 vii Abbreviations NLP - Natural Language Processing IVI - In-Vehicle Infotainment HMI - Human-Machine Interface VUI - Voice User Interface ADS - Autonomous Driving System NHTSA - National Highway Traffic Safety Administration SAE - Society of Automotive Engineers PA - Pilot Assist VA - Voice Assistant CSD - Center-Stack Display DIM - Driver Information Module AA - Android Auto AC - Apple CarPlay ix Contents List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Voice Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Voice Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Designing with Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Autonomous Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Distracted Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Guidelines for In-vehicle and Voice Interfaces . . . . . . . . . . . . . . 11 2.5.1 NHTSA Interface Guidelines . . . . . . . . . . . . . . . . . . . 11 2.5.2 Android Auto . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.3 Apple CarPlay . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.4 Google Assistant . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.5 Siri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Theory 15 3.1 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Wickens’ Attention Model . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Intensive and Selective Attention . . . . . . . . . . . . . . . . . . . . 17 3.4 Cognitive Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 Eye Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6 Elements of Voice Interfaces . . . . . . . . . . . . . . . . . . . . . . . 19 3.7 The Cooperative Principle . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Methods 21 4.1 Wicked Problems and Iterative Design . . . . . . . . . . . . . . . . . 21 xi Contents 4.2 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Summative and Formative Evaluation . . . . . . . . . . . . . . . . . . 22 4.4 Field and Lab Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.5 A/B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.6 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.7 Cognitive Workload Measuring . . . . . . . . . . . . . . . . . . . . . 24 4.7.1 NASA-Task Load Index . . . . . . . . . . . . . . . . . . . . . 24 4.7.2 The Driving Activity Load Index . . . . . . . . . . . . . . . . 26 4.7.3 Subjective Workload Assessment Technique . . . . . . . . . . 27 4.7.4 Rating Scale Mental Effort . . . . . . . . . . . . . . . . . . . . 27 4.8 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.9 Subjective Assessment of Speech System Interfaces . . . . . . . . . . 29 4.10 Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.11 Affinity Diagramming . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.12 Wizard of Oz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Process 35 5.1 Pre-study and Preparation . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Project Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Literature Review of Existing Guidelines . . . . . . . . . . . . . . . . 36 5.4 Summative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4.1 On-road Test Setup . . . . . . . . . . . . . . . . . . . . . . . . 37 5.4.2 Data Collection and Handling . . . . . . . . . . . . . . . . . . 39 5.4.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 Prototype Development and Evaluation . . . . . . . . . . . . . . . . . 42 5.5.1 Ideation and Prototype Development . . . . . . . . . . . . . . 42 5.5.2 Simulator Test Setup . . . . . . . . . . . . . . . . . . . . . . . 43 5.5.3 Data Collection and Handling . . . . . . . . . . . . . . . . . . 46 5.5.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6 New Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 Result 49 6.1 Literature Review of Existing Guidelines . . . . . . . . . . . . . . . . 49 6.1.1 Designing Car Apps . . . . . . . . . . . . . . . . . . . . . . . 50 6.1.2 Voice and Manual Input . . . . . . . . . . . . . . . . . . . . . 50 6.1.3 General Voice Responses . . . . . . . . . . . . . . . . . . . . . 50 6.1.4 Situation Awareness . . . . . . . . . . . . . . . . . . . . . . . 50 6.1.5 Presenting Choice . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1.6 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1.7 Discoverability . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1.8 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1.9 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Summative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 53 6.3 Prototype Development and Evaluation . . . . . . . . . . . . . . . . . 59 6.3.1 Prototypes Developed . . . . . . . . . . . . . . . . . . . . . . 59 xii Contents 6.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 62 6.4 New Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4.1 Voice and Manual Input . . . . . . . . . . . . . . . . . . . . . 67 6.4.2 General Voice Responses . . . . . . . . . . . . . . . . . . . . . 69 6.4.3 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.4.4 Situation Awareness . . . . . . . . . . . . . . . . . . . . . . . 71 6.4.5 Presenting Choice . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.6 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.7 Discoverability . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4.8 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7 Discussion 77 7.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 New Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8 Conclusion 85 Bibliography 87 A Project Plan GANTT Chart I B Summative Evaluation Survey III C Summative Evaluation Test Protocol V D Summative Evaluation Test Schedule XIII E Prototype Evaluation Test Protocol XV F Prototype Evaluation Survey XXI G Interaction Paths of VA Prototypes XXIII H Prototype Evaluation Schedule XXVII I DALI Survey XXXI J SUS Survey XXXV K Summative Evaluation KJ Results XXXVII L Prototype Evaluation KJ Results XLI M Summarized Existing Guidelines XLVII xiii Contents xiv List of Figures 2.1 The Android Auto GUI . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The CarPlay GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Google Assistant displaying results for nearby restaurants on a An- droid phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Wickens’ Multiple Resource Model [48] . . . . . . . . . . . . . . . . . 16 4.1 Design funnel as described by Bill Buxton where dashed lines indicate divergence and solid lines indicate convergence in the design process [7] 22 4.2 A full example of a SUS [6] . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 The SASSI questionnaire [24] . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Affinity diagram (partial) used to analyze qualitative data . . . . . . 33 5.1 Interior of the test car model, equipped with Android Auto and Apple CarPlay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Eye tracking software and video with three camera views . . . . . . . 40 5.3 Affinity diagram (partial) of qualitative summative evaluation data . 41 5.4 Prototype development in Adobe XD . . . . . . . . . . . . . . . . . . 43 5.5 Simulator test setup with a dividing wall between the test participant and wizard (not to scale) . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.6 Video used for eye tracking . . . . . . . . . . . . . . . . . . . . . . . . 46 5.7 Affinity diagram (partial) of qualitative prototype evaluation data . . 47 6.1 Frequency of off-road glances . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Intervals of DIM and IVI glances during four conditions . . . . . . . . 54 6.3 The count of DIM and IVI glances during the four conditions. . . . . 55 6.4 Count of off-road glances by direction and condition. . . . . . . . . . 55 6.5 Count of off-road glances during the various test tasks. . . . . . . . . 56 6.6 Count of off-road glances during tasks with error indications . . . . . 57 6.7 DALI weighted rating of Android Auto and Apple CarPlay . . . . . . 57 6.8 Adjusted ratings of the individual DALI dimensions . . . . . . . . . . 58 6.9 Weighted ratings of manual drive and pilot assist . . . . . . . . . . . 58 6.10 Prototype 1, left, and Prototype 2, right, and their differences when sending a text message . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.11 Prototype 1 and Prototype 2 with their differences in voice interaction for showing results in a list . . . . . . . . . . . . . . . . . . . . . . . . 60 6.12 Frequency of off-road glance duration times . . . . . . . . . . . . . . 62 xv List of Figures 6.13 Count of off-road glances by task . . . . . . . . . . . . . . . . . . . . 63 6.14 Count of off-road glances during tasks with error indications . . . . . 63 6.15 Count of off-road glances by task and prototype. . . . . . . . . . . . . 64 6.16 Weighted DALI ratings for Prototype 1 and Prototype 2 . . . . . . . 64 6.17 Adjusted rating of the dimensions of the DALI . . . . . . . . . . . . . 65 6.18 SUS score comparison with adjective ratings and acceptability ranges [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 xvi List of Tables 2.1 SAE levels of driving automation [41] . . . . . . . . . . . . . . . . . . 9 4.1 The NASA-TLX measurement factors and their descriptions [17] . . . 25 4.2 The DALI measurement factors and their descriptions [35] . . . . . . 26 6.1 Prototype similarities and differences . . . . . . . . . . . . . . . . . . 60 6.2 SUS Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 xvii List of Tables xviii 1 Introduction Voice assistants have exploded in popularity in recent years thanks to smart speak- ers. Natural Language Processing (NLP) allows these smart speakers to communi- cate with their users in a convenient and natural way and makes them suitable for helping their users with a large and varied set of tasks. One report predicts that 47% of American homes will have a smart speaker by 2022 [34]. As people grow accustomed to the voice assistants in their homes and on their phones, it is not un- reasonable to assume that drivers will use the same voice assistant from their daily lives in their cars. This possibility is in fact a reality in many modern cars that of- fer voice assistant integration directly into the in-vehicle infotainment system (IVI) through Android Auto or Apple CarPlay. Voice assistants, and voice interaction at large, offer drivers an eyes-free, hands-free way to complete secondary tasks while driving. However, voice assistants are susceptible to recognition issues and have transient, paced interaction flows that require immediate response from the driver. Despite the integration of voice assistants into vehicles, set guidelines for safe voice interaction are not well defined. As voice assistant integrated IVIs become increas- ingly prevalent, it is necessary to evaluate the existing commercially available voice assistant integrated IVIs in relation to distracted driving. 1.1 Purpose Voice interaction in vehicles have long been a topic of research [25]. However, many of the previous studies on voice interaction in vehicles focus on evaluating voice interactions developed by the OEM, such as the Chevrolet MyLink and the Volvo Sensus [28, 39]. Generally, voice interaction has been found to be a safer alternative to standard HMI inputs to the IVI [25]. However, the landscape of voice interaction in vehicles is expanding with voice assistants developed by software giants including Google and Apple. These voice assistants have seldom been examined in relation to distracted driving and have yet to be studied in a setup where they are fully integrated in an IVI. A deeper understanding of voice assistants’ effect on distracted driving is critical as voice assistants increasingly integrate into available IVIs. 1 1. Introduction Despite the heralded safety benefits of voice interaction, standards for safe voice interaction in vehicles are largely undefined. The National Highway Traffic Safety Administration (NHTSA), which has set guidelines and standards for safe manual interactions with IVIs, has yet to publish similar guidelines for voice interaction [30, 31]. With many voice interaction systems already commercially available, designers cannot continue to put off considerations for safe voice interactions while driving. Voice assistants in particular introduce the possibility for third-party designers and developers to create and distribute in-vehicle applications, such as navigation apps. Today’s guidelines treat the design of the IVI and voice assistant as two separate entities rather than one integrated voice-driven, multimodal experience [2, 14, 13]. As an added layer of consideration is the development of autonomous vehicles in par- allel to voice assistants. Society of Automotive Engineers (SAE) Level 2 autonomous driving offers drivers support in the primary task of driving through features such as adaptive cruise control and lane keeping [41]. However, a misunderstanding of how these systems work and their limitations may lead to overly trusting or relying on these support systems, thereby causing drivers to divert their attention from the primary task to a secondary one. The affect of voice assistants to complete sec- ondary tasks in combination with driver support systems should also be considered for safer interaction. 1.2 Aim The primary aim of this project is to improve voice assistant interactions to com- plete secondary tasks without compromising driver safety. Thus, it is necessary to assess the current state of the art of voice assistant integrated IVIs commercially available today and the design patterns they employ. These systems will be assessed in relation to their effect on distracted driving, which includes visual distraction and cognitive load. While voice interaction can reduce visual distraction, it has some possible drawbacks. It is transient, meaning it is non-persistent, and it can poten- tially increase the cognitive load of completing secondary tasks, such as mentally visualizing navigation instructions in an unfamiliar area. Both visual distraction and cognitive load must be carefully balanced to create what may be considered a safe interaction while driving. This project also aims to produce design guidelines for multimodal voice assistant-driven interactions for performing secondary tasks. 1.2.1 Scope This project is limited to the performance of secondary tasks using a voice assistant in a vehicle equipped with a voice assistant integrated IVI. Guidelines for currently existing integrated voice assistants will be evaluated and a new set of guidelines will be suggested. The new suggested guidelines will consist of currently existing 2 1. Introduction guidelines as well as new guidelines developed in this project. The project is further delimited to situations with a driver using a Level 2 or lower autonomous passenger vehicle with no additional passengers. The drivers are defined as civilian drivers, people who may drive as part of their daily commute, but not those who drive for extended periods of time as part of their profession, such as a taxi cab driver or a cargo trucker. As such, the design guidelines that will be produced as a result of this project may be limited to driving scenarios that also match the scope of the project. As this project considers the current state of voice assistants in vehicles, the produced guidelines would be directly applicable for the near future. As autonomous driving improves, reaching high or full level of automation, the types of tasks users will perform in the vehicle will likely shift to be more entertainment focused. However, even with the advent of fully autonomous vehicles, complete market adoption of such vehicles will not happen overnight. Thus the guidelines produced by this project will remain relevant for the remaining vehicles on the road that are Level 2 and under. 1.2.2 Stakeholders This project is carried out as part of a larger project known as SEER (Seemless, Efficient and Enjoyable user-vehicle inteRaction). SEER is a joint collaboration between Volvo Cars, Volvo Technology, RISE Viktoria, and Semcon; the project is funded by Vinnova [44]. The SEER project is focused on improving the experience of completing secondary tasks in low-level autonomous vehicles (up to SAE level 2). General findings and projects developed under the umbrella of SEER are available to the public to promote knowledge sharing and innovation in the automotive industry. 1.2.3 Ethical Concerns This project will use on-road tests to assess the current state of voice assistant integrated IVIs. In such testing conditions, test participant safety is paramount and shall take precedent over the test itself. Additional measures, such as using a specially equipped vehicle for facilitator interjection may be necessary in the interest of safety. An additional concern is the handling of personal user data. The commercially available voice assistants send and retrieve data to and from external servers owned by parties outside of this project, such as the voice assistant author company and third-party app services. Also concerning personal user data is the collection of data as video footage. Video footage collected as part of this project was done so with full consent from the test participants, where participants also had the option to have any personally identifying footage removed once the collected data was analyzed. 3 1. Introduction 1.3 Research Questions In the context of SAE Level 2 (and lower) vehicles, this project addresses the fol- lowing questions: 1. What adjustments to existing NLP-based voice assistant design guidelines should be made for safer interaction while driving? (a) What existing design guidelines and patterns are implemented in voice assistant integrated infotainment systems? (b) What improvements to existing voice assistants can be made to minimize diverted attention from the primary task of driving? (c) What improvements to existing voice assistants can be made to minimize cognitive load while executing a secondary task during the primary task of driving? In the primary research question, safer interactions are defined with respect to how well they comply to NHTSA design guidelines for human-machine interfaces (HMIs) and reduce distracted driving [30, 31]. While the NHTSA guidelines explicitly do not consider voice interaction, they may serve as a starting point as the infotain- ment systems examined are multimodal. Moreover, NHTSA has yet to define safe interactions with respect to voice interaction, though it has plans to do so in the future. The current NHTSA guidelines related to this project are covered in section 2.4 and 2.5.1. 4 2 Background This project delves into several areas including voice interaction, autonomous cars, and distracted driving. This chapter provides a brief history of each area and pre- vious scientific work researching the intersection of all three. 2.1 Voice Interaction The field of voice interaction has experienced a recent increase in interest thanks to the introduction of smart speakers to market. However, voice interaction long predates smart speakers and early interactive voice systems were first introduced in the 1990s [22]. These early systems were known as finite state voice user interfaces (VUIs). VUIs are typically categorized as either finite state or natural language processing (NLP), but hybrids do exist [22]. Finite state VUIs are characterized by a limited set of commands for each point in the interaction flow, typically in a tree menu [22]. Most people encounter finite state VUIs on the phone, in the form of automated customer service systems. These systems are usually met with frustration as many users have difficulty finding the information or action they want in a tree menu. Natural language processing VUIs improve upon their finite state predecessors by recognizing a wider array of user input for the same action through statistical lan- guage modeling. Prime examples of NLP VUIs are the voice assistants available on smartphones and smart speakers. Voice assistants typically process voice input off- site through cloud-computing. The most popular voice assistants include Amazon Alexa, Google Assistant, Apple Siri, and Microsoft Cortana. These voice assistants allow users to interact with them in a more natural, conversational manner. More- over, thanks to their off-site processing, voice assistants can learn and improve over time as more users interact with them [22]. 5 2. Background 2.1.1 Voice Assistants Voice assistants take NLP VUIs to the next level. Not only can they understand and respond to conversational input, they can use information about the user to provide relevant responses. For example, users can ask a voice assistant, "What are upcoming Robyn concerts?" and the assistant can respond with upcoming concert dates in the user’s city with the option to hear about concerts in other cities. However, not all tasks completed through a voice assistant take advantage of this contextual information and may require users to repeat information, introducing frustration into the input process. Figure 2.1: The Android Auto GUI Figure 2.2: The CarPlay GUI Voice assistants have recently made the leap from the phone to the car through IVI integration interfaces. Integration interfaces allow drivers to connect their smart- 6 2. Background phone to their car’s IVI, enabling drivers to access some of the functionality of their phones directly on the IVI, including voice assistants. Two platforms that offer IVI integration interfaces are Google and Apple. Google’s Android Auto allows drivers to connect their Android phone and Google Assistant to the IVI. The Android Auto home screen can be seen in figure 2.1. Similarly, Apple CarPlay allows iOS users to integrate their iPhone and Siri assistant into the IVI. The Apple CarPlay home screen can be seen in figure 2.2. Both Android Auto and Apple CarPlay enable drivers to use select apps from their phone on the IVI. Only apps which belong to an enabled category and have been developed for in-vehicle use may be available on the IVI. Integration interface authors, such as Apple and Google, dictate which categories may be enabled. Categories enabled on both Android Auto and Apple CarPlay are communication, navigation, audio, and automaker [2, 13]. The communication category includes apps with messaging and VoIP calling features. Navigation apps allow drivers to locate points of interest and provide driving directions. Audio apps cover an array of audio services which include music streaming services, podcast stations, and sports news. Automaker apps allow drivers to get information about their car and adjust car settings through the integration interface. If a driver’s voice assistant is enabled on the phone, then enabled apps may be used via voice assistant. However, the degree of voice interaction is left to the discretion of the app developer. If a developer has chosen not to include voice interaction, then some features of the app may not be steered by the voice assistant, instead requiring manual interaction. 2.2 Designing with Voice While there are pure VUIs, only allowing voice input and output, many interfaces provides multiple modes for both input and output. Screens, keyboards, and other types of input are combined with voice to produce multimodal interfaces. There are several approaches to using voice in interaction which can be categorized as screen-first, voice-only, and voice-first [10, 47]. The screen-first approach prioritizes the screen first and utilizes voices to enhance screen functionality [47]. The screen-first approach is currently applied to most smartphones as the voice assistants are highly dependant on the screen. In many cases, the user is unable to complete a voice-initiated interaction without manual input through the screen [47]. For example, if a user requested nearby restaurant recommendations, a screen-first system may read aloud the first recommendation and output the remaining alternatives on the screen for the user to manually select an option. Screen results from asking Google Assistant for nearby restaurants can be seen in Figure 2.3. A voice-only interaction uses only voice for both input and output, unlike screen-first and voice-first. Early screenless smart speaker models such as the Amazon Echo, 7 2. Background Figure 2.3: Google Assistant displaying results for nearby restaurants on a Android phone Google Home, and Apple HomePod are examples of voice-only design [47]. Due to the singular mode of input and output, using voice-only interactions to complete simple tasks can become tedious [47]. A voice-first approach is the inverse of the screen-first approach. In a voice-first design, a complementary display is used to visually supplement the voice interac- tion and a user can complete an interaction through voice alone [47]. Voice-first has been widely embraced in the latest models of voice assistants like the Amazon Echo Show and the Google Home Hub which include touchscreens. The voice-first ap- proach is different in that many traditional graphical user interface elements, such as heavily-nested menus and visually dense content, are completely eliminated in favor of contextualizing information to enhance whatever the voice is communicating [47]. Moreover, a voice-first approach assumes that the user may not always have access to look at or touch the screen; therefore, voice carries the bulk of the interaction in a voice-first approach [47]. 8 2. Background 2.3 Autonomous Cars As the field of voice interaction continues to develop, so does the field of autonomous driving. According to the SAE International, there are six levels which describe the level of autonomous driving a car is capable of, as shown in Table 2.1. It is worthy to note that Levels 2 and under still require a human driver to perform part or all of the driving task, even with the autonomous driving system (ADS) engaged [41]. In contrast, vehicles classified as Level 3 and up are able to fully takeover the primary task of driving, under varying scenarios [41]. Table 2.1: SAE levels of driving automation [41] Level Autonomous Driving System Role Human driver monitors driving environment Level 0 No Driving Automation Does not perform any of the driving task on a sustained basis Level 1 Driver Assistance Performs part of the driving either in the lon- gitudinal OR lateral motion and can be disen- gaged immediately upon driver request Level 2 Partial Driving Automation Performs part of the driving in both the lon- gitudinal AND lateral motion and can be dis- engaged immediately upon driver request Autonomous Driving System monitors driving environment while engaged Level 3 Conditional Driving Automa- tion Performs all of the driving under select driver- manageable conditions and can be disengaged immediately by the driver or issue a request for the driver to intervene Level 4 High Driving Automation Performs all of the driving under most diver- manageable conditions and may delay driver- requested disengagement Level 5 Full Driving Automation Performs all of the driving under all driver- manageable conditions and may delay driver- requested disengagement For Level 2 and under autonomous driving, ADS can provide drivers assistance with the primary task of driving. This can in turn free up some of the driver’s attention and cognitive load to complete secondary tasks, such as tuning the radio, replying to a text message, or getting directions to a nearby point of interest. However, in Level 3 and up autonomous driving, handing off the primary task of driving to the ADS from the human driver may introduce a new interaction paradigm in the vehicle. In scenarios where the human driver is no longer responsible for the driving, the primary task may shift dramatically from driving to other tasks, such as entertainment or work. 9 2. Background 2.4 Distracted Driving The National Highway Traffic Safety Administration (NHTSA) is the U.S. govern- mental agency responsible for setting and enforcing safety standards in vehicles [30]. In 2016, the NHTSA reported that 3,450 deaths in the United States were report- edly due to distracted driving [29]. The year prior, a staggering 391,000 people suffered injuries from distracted driving related incidents [29]. With statistics like these, distracted driving has become a key traffic safety issue. According to the NHTSA, distracted driving refers to the inattention of drivers from the primary task of driving to other activities or secondary tasks [30]. Electronic devices in particular are an area of concern for the NHTSA as more and more technology is incorporated into modern vehicles. Electronic devices can influence drivers by causing visual distraction, manual distraction, and cognitive distraction [30]. In an effort to combat distracted driving from electronic devices, the NHTSA has thus far issued two phases of guidelines for designing in-vehicle electronic devices. Phase One of the design guidelines concerns the design of original equipment (OE), such as the in-vehicle infotainment system that already comes installed on a vehicle [30]. Phase Two extends the guidelines from the first phase to include portable and aftermarket devices, which includes smartphones with a car mode [31]. Both guidelines use eye glance metrics as acceptance criteria where eye glances away from the road for more than 2.0 seconds are correlated with an increased crash risk [30, 31]. While both Phase One and Phase Two acknowledge voice interaction as an alternative to traditional HMIs, both guidelines explicitly do not include voice interaction. The NHTSA has announced plans for Phase Three of the guidelines, which would provide recommendations specifically for voice interaction; however, there is currently no set date for when these guidelines will be published, leaving the definition of safe voice interaction in vehicles largely undefined. While the jurisdiction of the NHTSA is limited only the United States of America, its safety recommendations extend beyond those borders. In following the guidelines set by the NHTSA for vehicles in the American market, car manufacturers in practice also apply these guidelines to vehicles in markets outside of the United States. Alter- nate guidelines for designing in-vehicle interfaces include those published by Japan Automobile Manufacturers Association (JAMA), Alliance of Automobile Manufac- turers (AAM), and the EU [20, 26, 9]. However, the NHTSA guidelines are the most recent guidelines and likely the most relevant when considering voice interaction as an emerging technology. 10 2. Background 2.5 Guidelines for In-vehicle and Voice Interfaces At present, few guidelines consider the holistic interface of a voice-assistant inte- grated IVI. However, the existing guidelines for both in-vehicle and voice interfaces outline important considerations for each respective interface that should be taken into account. 2.5.1 NHTSA Interface Guidelines Phase One of the NHTSA interface guidelines are applicable to original IVIs [30]. Recommendations in the Phase One guidelines include where to place the IVI, what tasks should not be allowed on the IVI, and IVI response time. The guidelines also describe a number of best practices for interacting with an IVI manually. Some notable interaction guidelines include single-handed operation, interruptibility, and disablement [30]. Drivers should be able to operate the IVI with a single hand and while driving and the IVI should not require the driver to complete an uninterrupted sequence of tasks [30]. Drivers should be able to stop a task mid-way and then resume the task if not completed [30]. Additionally, IVIs should have the ability to disable the display of any non-safety related information through methods including dimming, blanking, or changing the state of the display [30]. Phase Two of the NHTSA guidelines expand upon those covered in Phase One to include the interfaces of portable and aftermarket devices [31]. Notable additions from the Phase Two guidelines include pairing devices, driver mode, and access to emergency services and alerts [31]. For devices that can be paired with the original IVI, the pairing and disconnection should be easy to complete. When paired and using the IVI display, guidelines from Phase One should also be followed [31]. For unpaired devices, there must be a driver mode which conforms to the Phase One recommendations [31]. As the second set of guidelines are an expansion, portable and aftermarket devices described in Phase Two of the guidelines must also follow the guidelines defined in Phase One. Notable additions from the Phase Two guidelines include pairing devices, driver mode, and access to emergency services and alerts [31]. In both scenarios, emergency services and alerts must be easily accessible [31]. However, the guidelines do not state what additional notifications should also be accessible, such as communication notifications. 11 2. Background 2.5.2 Android Auto Android Auto is the integration interface made available by Google for compatible Android phones. Only apps which fall into the navigation, communication, media, or automaker categories can be enabled for use through Android Auto [13]. The Android Auto design guidelines are primarily concerned with the appearance and structure of visual content on the IVI. Android Auto uses a global UI, which means the visual interfaces of each app uses a template provided by Google [13]. By using a template approach, drivers using Android Auto do not need to learn app- specific UIs when switching between two apps in the same category. The Android Auto guidelines make almost no mention of designing for voice interaction, save for constructing or replying to a message [13]. The Android Auto guidelines prescribe recommendations for user input, menu or- ganization, and notification display. The pace of input into the IVI should be deter- mined by the user [13]. This recommendation aligns with the NHTSA guidelines for interruptibility. The Android Auto guidelines also suggest items in the drawer menu be context specific [13]. For example, rather than displaying broad categories such as "All Songs" and "All Artists" the menu items should be more specific such as "Top Hits" or "Favorite Artists". The guidelines also state that notifications may be used if they are appropriate to driving or important enough to interrupt the driver [13]. However, Android Auto provides little guidance on what is considered "important enough" and leaves it up to the discretion of the designer. 2.5.3 Apple CarPlay Like Android Auto, the Apple CarPlay guidelines use a global set of UI elements and a template system [2]. Voice integration is briefly described for automaker and communication apps, though Apple does have a separate guideline for custom Siri voice commands [2]. When CarPlay is active, interactions on the iPhone should be eliminated and CarPlay interactions should never require input from the iPhone [2]. The Apple guidelines also provide a number of test conditions for designing a CarPlay enabled app [2]. For example, apps should be tested in an actual car, not a simulator alone, and in varying network conditions [2]. Generally, the Apple CarPlay guidelines provide more guidance to designers re- garding the architecture of apps including badging, error handling, and navigation structure. The Apple CarPlay guidelines also provide detailed recommendations for content writing, organization, and notifications. Written content in CarPlay should be succinct and avoid accusatory or judgmental tones [2]. Content and navigation should require as few inputs as possible, either through flat or hierarchical naviga- tion [2]. Moreover, there should only be one path for manual input to a specific view [2]. Alerts should be minimized and used only when there is error so users will take them seriously [2]. 12 2. Background 2.5.4 Google Assistant Google’s design framework for voice interaction is called Conversation Design. It is an extensive framework with a lot of detailed information and examples. Google highlights the framework as being multimodal and consisting of many different dis- ciplines of design such as voice, audio and visual design. Google argues that all of these disciplines are required to design real conversations as, according to them, real conversation is a multimodal activity. The Conversation Design framework is built upon Grice’s Cooperative Principle. This principle states that conversation is shaped by the social context and that this shaping of the conversation relies on a type of subconscious cooperation between the conversing parts, Grice’s Cooperative Principle is covered in depth in section 3.7 in this report. The design framework provides extensive guidelines regarding the aspects of context of conversation, variations of phrases and turn-taking during dialogues. A shorter list of visual components to be used together with voice assistants is also provided. Information regarding how and when graphical components are to be used in com- bination with conversation is however very limited and the few guidelines related to this that exists, are very general. 2.5.5 Siri The Siri voice guidelines describe how to integrate the voice assistant in a variety of contexts for a seamless voice-driven experience [3]. Moreover, the guidelines describe when Siri would enhance an interaction and how to create Siri responses. The Siri framework supports shortcuts which can perform useful or frequent actions without much navigation [3]. Shortcuts should be short and concise, but also not context- specific [3]. An example shortcut could be "Order clam chowder". Designers can make shortcuts more relevant and accurate using custom vocabulary or providing examples on the screen [3]. Like Google Assistant, Apple recommends that Siri responses are conversational. Apple additionally recommends that actions should be voice-driven with as little manual input as possible, a voice-first approach. Verbal responses from Siri should be accurate and relevant to the user’s request [3]. 2.6 Related Research Voice interaction in vehicles has been well-researched in terms of distracted driving and usability. However, as the voice interfaces continue to evolve, so do research opportunities in the field. Previous research of automotive VUIs has generally fo- 13 2. Background cused in-vehicle VUIs. In other words, VUIs that are built into the car by the OEM, instead of portable alternatives such as modern day voice assistants. In their 2013 review, Lo and Green surveyed key researched in-vehicle VUIs [25]. The VUIs covered by Lo and Green all used NLP, but they did not utilize cloud- computing as voice assistants do [25]. Core functionality between the systems sur- veyed included communication, media, and navigation, not unlike the enabled app categories on both Android Auto and Apple CarPlay [25]. However, some of these systems had extended functionality, such as climate control via voice command [25]. More recent studies compared different VUIs against each other to identify the effect of different voice-driven multimodal interactions on distracted driving. Mehler et al. compared the Chevrolet MyLink and Volvo Sensus against each other, where the former allows for ‘one-shot’ voice input while the latter requires input through a series of menus and sub-menus [28]. For most tasks, ’one-shot’ input performed better than guided, menu-based input given no recognition errors [28]. However, if there were recognition errors, the ’one-shot’ input, similar to that of current voice assistants, increased driver workload and caused user frustrations [28]. Reimer et al. further expanded upon the work by comparing a Samsung S-Voice assistant against the two in-vehicles systems evaluated by Mehler et al. [28, 39]. Reimer et al. found that the smartphone assistant actually performed worse that the embedded in- vehicle systems [39]. However, they proposed that perhaps coupling the smartphone into the embedded IVI to create one holistic experience may reduce workload and visual demand [39]. One study that does examine the holistic experience of a voice-assistant integrated IVI on distracted driving was conducted by Strayer et al. for the AAA Foundation for Traffic Safety [42]. Motivated by the lack of Phase Three guidelines from the NHTSA, this study investigated how Apple’s Siri affects distracted driving [42]. The study found that the use of a voice assistant to carry out a secondary task significantly increased the crash risk; however, the study has yet to be corroborated and does not provide suggestions to address the issue of increased risk [42]. Beyond voice interaction, distracted driving has been studied in many capacities. A 2009 review by Bach et al. surveyed 100 papers related to attention understanding within automobiles [4]. Despite the extensive studying of attention and cognitive load while performing secondary tasks, the review makes it clear that there is no one singular method for assessing attention and cognitive load [4]. Previous studies have used primary task performance, secondary task performance, eye glance behavior, physiological measures, and subject assessments to measure attention and cognitive load [4]. The variety of methods and the lack of a singular standard illustrate the difficulty in capturing and measuring what goes on in the mind while performing multiple tasks. 14 3 Theory Voice interaction, especially for in-vehicle use, sits at the crux of many fields includ- ing design research, attention, cognitive load, and linguistics. This chapter covers the theory and domain-specific knowledge from these fields that are related to this project. 3.1 Research Approach The research approach of this project relies on human-centered design (HCD), where user involvement and testing with users is central to the design and development of a product. The idea that the solutions to a problem is held within the very people who face this problem is a core idea of HCD [19]. Social research principles also support the frequent user involvement in this research project. One prevalent idea in the social research approach is that if enough people agree on a subjective opinion, it can become an objective fact [46]. This can be said of the design field, where many designers and researchers consider involving users as part of the design process or design research to be standard, thus objectively validating HCD as a approach. HCD largely focuses on understanding the users and evaluating with and for users throughout the process [12]. Another characteristic of a HCD approach is applying a wide range of disciplinary skills and perspectives [12]. This project especially applies theory from psychology and cognition to be able to properly research the user’s attention and cognitive load. Applying a varied set of theories and concepts from different fields is, according to Gaver, a way to both inspire and articulate new and already existing designs [11]. 3.2 Wickens’ Attention Model The attention of a human being is a limited resource. When it comes to the task of driving and all of the secondary tasks that follow in a modern car, managing attention and distributing it correctly becomes very important. There are various 15 3. Theory theories explaining the complexities of human attention resources. One which has proven to be especially relevant to mental workload in relation to multitasking is Wickens’ Multiple Resource Theory [48]. According to this theory, the attention of humans can be divided into different resource pools. The different resource pools represent the humans ability to process different types of stimuli. The internal processes are divided into perception, cognition and response. Figure 3.1 shows a four-dimensional model of the resource model. Figure 3.1: Wickens’ Multiple Resource Model [48] According to Wickens, humans are able to perceive four different types of input: spatial-auditory, verbal-auditory, visual-spatial and visual-verbal [48]. Multiple Re- source Theory posits that multiple simultaneous inputs are better perceived if they are of different types. When internal mental processes move from the perception of input to the cognition of it, humans are capable of simultaneously processing verbal and spatial input. In the final internal process, humans are capable of deciding a response to manual-spatial and vocal-verbal input at the same time. However, the ability to simultaneously process input is still affected by the weight and complexity of the individual inputs. This means that very complex spatial input will affect a person’s ability to process other input at the same time, even if the additional input is of another modality. Multiple Resource Theory helps to reinforce the findings of previous research which concludes voice interfaces as being a safer input method in vehicles [28, 39, 25]. Ac- cording to the theory, verbal information from a voice interface would never interfere with the visual information from looking at the road as both inputs are processed in the driver’s mind. 16 3. Theory 3.3 Intensive and Selective Attention Another theory for explaining human attention is Kahneman’s work on effort and attention [21]. According to Kahneman, the two most important factors affecting attention are intensity and selectivity [21]. Intensity is directly connected with the effort one applies to their current focus of attention [21]. A person may direct greater effort into a specific focus of attention when motivated by arousal or personal choice [21]. Selectivity describes how a person decides to distribute their effort toward different sources of attention [21]. Ultimately, the total amount of effort available at a given moment is limited [21]. Problems occur when different sources of attention and their demand of effort inter- fere with each other. This explains the difficulty behind dividing attention, such as in multitasking. The idea of interference in distribution of attention is interesting, as it provides a contextual explanation of the ideas presented by Wickens’ Atten- tion Model which were summarized in section 3.2. explaining the difficult task of dividing attention. 3.4 Cognitive Load There are many different definitions of cognitive load, sometimes also referred to as cognitive workload. Waard decomposes cognitive workload into two parts: demand and load [45]. Demand is the specific external task demand a task places upon a user. Load is the individual effect of the task demand placed upon a user. Task demand is highly dependant on the complexity of the task. Increased task complexity increases the demand of the task. Perceived load is more complex and depends on a variety of factors including skill, experience, and current mood of the person performing the task. When examining cognitive load, both task demand and task load should be considered, as the two are closely related. In a driving situation, the main task of driving places a certain demand on the driver. Depending on the driver’s skill level and experience, the perceived load will vary. When adding secondary tasks, like making phone calls and playing music, the total load of the driver further increases. When analyzing the cognitive load, there are several aspects to consider. Cognitive load essentially is a measure of how many mental processing resources are available. The upper limit of resources is referred to as the capacity [45]. In a practical scenario where cognitive load is measured, a researcher tries to measure how many resources are available and how close the test participant is to their capacity limit. In a driving scenario, the driver always needs to have enough resources to handle the primary 17 3. Theory task of driving. In section 3.2 the concept of attention resources was introduced. Attention resources are closely related to mental processing resources. Perceiving input, the first step of the previously mentioned Wickens’ Attention Model [48], is a prerequisite to pro- cessing input through the consumption of mental resources. The stages of cognition and response in Wickens’ model correspond with the mental processing concepts that are central to discussing cognitive load. 3.5 Eye Movement In order to assess visual distraction, it is important to understand how to analyze a person’s eye movements, through four basic movements. These four eye move- ments are saccades, smooth pursuit movements, vergence movements, and vestibulo- occular movements [37]. Saccades is the most basic type of movement. Saccades are quick movements that occur when a person changes their eye’s fixation point from one to another. [37]. Saccades may be short or long depending on the situation. When driving, the moment between saccides can be interesting to analyze as the user’s fixation points are likely to switch between on-road and on the various interfaces within the vehicle. Smooth pursuit movement occurs when a person fixates their view on a moving object. Smooth pursuit movement is difficult to perform without a moving object. Attempts to perform this eye movement by the untrained may actually instead be a series of short saccades [37]. Vergence movements occurs when a person fixates on a point that moves either closer or further away from the person [37]. Vergence movements are different from the two mentioned above, since the eyes during this movement moves in different directions from each other compared to moving in the same direction during saccades and smooth pursuit movement [37]. Vestibulo-ocular movements are made in order to stabilize the eyes during move- ments from the outside world such as fixating on a point while the head is moving in some direction [37]. When working eye tracking, several types of data can be analyzed. One type of data is glances. A glance is a fixation on a specific point in the world between two saccades. By this definition, glances have both a duration and a direction. With respect to this project, glances are a highly relevant type of data as they are used in part by the NHTSA to define safe task interactions [30]. Glance directions can be divided into glance areas of interest in order to more easily measure glances on specific areas of interest within the car. 18 3. Theory 3.6 Elements of Voice Interfaces To understand and discuss VUIs, it is important to know the basic elements of a voice interface. These elements are: utterances, responses, prompts, and intents. Together, these elements create a dialog, a linguistic exchange between the user and the VUI [16]. An utterance is a natural unit of speech which can range from a single word to a small cluster of sentences [16]. With respect to VAs, utterances are usually inputs from the user. A response is the second utterance in a summons/response pair [16]. If a summons is a request from a VA user, such as "What’s the weather today?" then a response manifests as information related to the day’s weather. A prompt is a system utterance that helps guide user input [16]. Prompts are most often in the form of questions which can be explicit ("Which flowers would you like to order, roses or daisies?"), implicit ("Which type of music would you like to listen to?"), or open-ended ("What can I do for you?") [16]. Inferential prompts are typically statements that convey to the user the capabilities of the VUI ("I can answer questions about train arrivals, departures, and on-board amenities.") [16]. An intent is a representation of action or a feature that fulfills a user’s spoken request. Intents may include variable information to complete a user’s request. In the previous example for responses, the intent is to get weather information where "today" was a variable that enables the VUI to respond with relevant information. Utilizing these elements, and mimicking a VUI’s way of processing these, will be necessary when trying and testing Wizard of Oz style prototypes. 3.7 The Cooperative Principle NLP VUIs aim to function through conversation between the user and the system. To design computers to converse in a natural way, VUI designers must understand the underlying principles of conversation. The semantics of conversation have been carefully studied by H. Paul Grice who has defined the underlying mechanics of conversation through a set of principles [15]. Together, these principles are know as The Cooperative Principle, which is made up of four sets of subprinciples or maxims [15]. The maxims describe the subconscious cooperation the occurs as a person formulates sentences in a conversation [15]. Grice’s Maxims are as follow [15]: 19 3. Theory Quality 1. Make your contribution as informative as required (for the current purposes of the exchange). 2. Do not make your contribution more informative than is required. Quantity 1. Try to make your contribution one that is true. (a) Do not say what you believe to be false. (b) Do not say that for which you lack adequate evidence. Relation 1. Be relevant. Manner 1. Be perspicuous. (a) Avoid obscurity of expression. (b) Avoid ambiguity. (c) Be brief (avoid unnecessary prolixity). (d) Be orderly. These maxims can be used to formulate the output of a VUI. They can also be applied when designing the VUI to anticipate different user inputs and how the sys- tem should respond to them. This applies for designing the dialog of any prototypes developed as a part of this project. 20 4 Methods This chapter covers all methodology relevant to the project. Usage details regarding the methods, suitable contexts of use and alternative methods are discussed. The methods are varied ranging from purely evaluative to creatively stimulating and can be utilized at different points throughout the project. 4.1 Wicked Problems and Iterative Design Many of the challenges and problems designers aim to solve are known as wicked problems. Rittel and Webber were the first to define wicked problems, which are problems that are unique, have no definitive formulation, have no stopping rule and whose solutions are not true-or-false but good-or-bad [40]. By comparison, there are tame problems which have a definite formulation and solution, such as math problems which have stopping rules to indicate when a solution has been reached and equations by which the solution can be verified as true or false. Solutions to wicked problems are rated on a scale of good or bad, where some solutions are better than others and some maybe be considered a good enough solution to the problem. Thus, as many designers tackle wicked problems, they may use an iterative design process to explore several solutions to find a better or good enough solution. There are four basic activities in a design process: establishing requirements, design- ing alternatives, prototyping, and evaluating [36]. Iterative design is the process by which a design is refined by user feedback through the repetition of these four design activities. The iterative design process has been visualized as a design funnel, where at the start of the process, designers begin at the wide end of the funnel and explore a broad number of potential design solutions [7]. As designers progress through the design process, they move towards the narrow end of the design funnel, reducing the number of possible design solutions and ultimately arriving upon a design solution [7]. In an iterative design process, each iteration is a step toward narrowing the design funnel. However, each iteration in itself is not narrowing, or reducing [7]. In fact, 21 4. Methods Figure 4.1: Design funnel as described by Bill Buxton where dashed lines indicate divergence and solid lines indicate convergence in the design process [7] each iteration is a combination of divergent and convergent thinking where the divergence comes from the generation of new ideas and improvements to a design and convergence is the reduction of those solutions into an iteration or prototype of the design [7]. With respect to wicked problems, each iteration adds knowledge and is an attempt to define and solve the problem. 4.2 Literature Reviews Literature review is conducted by researching and reviewing research literature rel- evant to the field of study [27]. The purpose of a literature review is to gather knowledge from previous research or findings to guide new research within a related field [27]. A literature review can vary in its result, from establishing a theoreti- cal framework for discussing previous and future research to practical information, such as guidelines for designing for a specific context. Literature reviews enable researchers and designers to make connections and cross-references between several literature sources in order to understand the larger context behind their own work as well as how their own research can provide new knowledge. 4.3 Summative and Formative Evaluation Evaluative testing can be divided into two types: evaluative and summative [33]. A summative evaluation is focused on evaluating the quality of a system or a product [33]. It is typically suitable in the end of a design process, evaluating a finished 22 4. Methods system, but also when two alternatives are available or when market competitors are analyzed. Summative evaluations tend to be focused on measuring quantitative data [33]. A formative evaluation is focused providing input to improve a system of a product [33]. It is typically done in an iterative design process, driving the design forward and motivating design choices and improvements [33]. Formative evaluations are more focused on providing qualitative input [33]. 4.4 Field and Lab Testing There are several different possible approaches to testing the voice assistants in cars. For this project, the considered options are: in a car simulator, in a real car on a test track or in a real car on real roads. There are specific pros and cons of each method but the contextual aspects of sitting in a real car are weighted as being especially important. Simulations have the great benefit of being a completely controlled environment where the scenario can be completely consistent between tests. A large disadvantage of using simulation is that the participants never feels the sense of real danger as a consequence of their driving, this might lead to the driver adapting a more reckless driving style than their usual, affecting the overall outcome of the test [8]. Doing testing in a real car while driving on actual roads with traffic has the benefit of providing real, contextual information and performance shaping factors but at the same time, the environment is completely uncontrollable. Traffic situations, weather, red and green lights are all factors that would be completely random. Knowing exactly how these factors affect the results is very difficult. Conducting tests in a real car on a closed off controlled test circuit allows for some of the benefits of both previously mentioned methods. The environment can be better controlled. Real traffic situations can be mimicked and since the participants are driving real cars, the sense of consequence and danger is there, forcing the driver to always pay close attention to their driving. Weather still is an uncontrollable factor. 4.5 A/B Testing A/B testing means testing of two different version of a design so that results can be compared and it can be determined which one who performs better [27]. The A/B testing method is not qualitative, the two versions A and B are only measured by how much they fill a certain quantitative criteria. An example could be two versions of a voice assistant where the time to complete a specific task is measured. The A/B test would in the case of the example only result in knowledge about which one is faster, not why it is faster [27]. In order to cope with this lack of qualitative 23 4. Methods data, it is recommended that it is combined with other, qualitative methods. 4.6 Interviews Conducting interviews is a method for design research that allows direct interaction with users and allows researchers to take part and explore the user’s personal views, experiences and perceptions about a subject [27]. Interviews are best done in person so that the researcher may collect information in the form of body language and facial expressions as well as what is actually said by the user [27]. Interviews can be structured or unstructured. Structured means that all questions are planned in advance and unstructured has the questions made up as the interview is active [27]. There are combinations where topics and some base questions are formed in advance but the the interviewer is allowed to ask new unplanned questions if he or she wishes, this is sometimes called semi-structured interview. Interviews is a very flexible method and allows customization and tweaking for specific uses. Interviews can be done in groups or individually and it can be focused on attaining information from specific roles or user groups [27]. 4.7 Cognitive Workload Measuring This section covers details and differences of four different methods that have been developed for the purpose of measuring a subject’s cognitive workload. 4.7.1 NASA-Task Load Index The NASA-Task Load Index (NASA-TLX) is a rating based measurement method for assessing the subjective experience of workload during activities [17]. The method divides the workload into several specific workload sources which allows specific sources of workload to a specific task to be identified [17]. The method has two steps, first a set of rating scales, then pairwise comparisons. The first step consists of rating all possible sources of workload on a 20-point scale representing 0 to 100 in steps of 5. The different sources of mental can be seen in table 4.1. 24 4. Methods Table 4.1: The NASA-TLX measurement factors and their descriptions [17] Title Endpoints Description Mental De- mand Low/High How much mental and perceptual activity was required (e.g., think- ing, deciding, calculating, remem- bering, looking, searching, etc.)? Was the task easy or demanding, simple or complex, exacting or for- giving? Physical Demand Low/High How much physical activity was required (e.g.. pushing, pulling, turning, controlling, activating, etc.)? Was the task easy or de- manding, slow or brisk, slack or strenuous, restful or laborious? Temporal Demand Low/High How much time pressure did you feel due to the rate or pace at which the tasks or task elements occurred? Was the pace slow and leisurely or rapid and frantic? Performance Low/High How successful do you think you were in accomplishing the goals of the task set by the experimenter (or yourself)? How satisfied were you with your performance in ac- complishing these goals? Effort Low/High How hard did you have to work (mentally and physically) to ac- complish your level of perfor- mance? Frustration Level Low/High How insecure, discouraged, irri- tated, stressed and annoyed versus secure, gratified, con- tent, relaxed and complacent did you feel during the task? The second part of the NASA-TLX is a weighting process to be able to weight the ratings in accordance to how much the influenced the task. Possible pairwise combinations of the sources of workload are compared and the user gets to choose which one out of the two influenced the task more than the other. This leads to a weighted rating for each one of the sources of workload and the total weighted task load score is calculated through the average value of the weighted scores. 25 4. Methods 4.7.2 The Driving Activity Load Index The Driving Activity Load Index (DALI) is a subjective evaluation method for eval- uating the cognitive workload of car drivers [35]. The method is largely based on the NASA-TLX but is revised to more carefully evaluate aspects that are specifically rel- evant to driving, ruling out aspects like e.g. physical demands [35]. A complete list of all the measurements factors of the DALI method and corresponding descriptions can be seen in table 4.2. Table 4.2: The DALI measurement factors and their descriptions [35] Title Endpoints Description Effort of Attention Low/High To evaluate the attention required by the activity – to think about, to decide, to choose, to look for and so on. Visual De- mand Low/High To evaluate the visual demand nec- essary for the activity. Auditory Demand Low/High To evaluate the auditory demand necessary for the activity. Temporal Demand Low/High To evaluate the specific constraint owing to timing demand when run- ning the activity. Interference Low/High To evaluate the possible distur- bance when running the driving activity simultaneously with any other supplementary task such as phoning, using systems or radio and so on. Situational Stress Low/High To evaluate the level of con- straints/stress while conducting the activity such as fatigue, inse- cure feeling, irritation, discourage- ment and so on. The method is used after a user has performed a task or a set of tasks related to driving. The user ranks each of the measurement factors on how big of an impact they had on the task on a two point scale, ranging from very low to very high. The measurement factors are then weighted in relation to each other, the user is shown two factors at a time and chooses the one of the two which had the most impact. This is repeated with factors until all possible combinations has been shown. The total number of times a certain factor has been chosen as the most impactful is its weight number. The original rank score, that the user filled out, is multiplied with its corresponding weight to produce that aspect’s adjusted score. The sum of all weighted scores divided by 15 represent the weighted rating for the whole task 26 4. Methods 4.7.3 Subjective Workload Assessment Technique The Subjective Workload Assessment Technique (SWAT) is a scaling procedure that allows test participant to put number on their subjective experience of mental workload during a task [38]. It was originally developed for the U.S. Air Force to be used to assess their pilots mental workload [38]. The SWAT method measures the workload in three different dimensions, these are Time Load, Mental Effort Load and Psychological Stress Load [38]. These three dimensions are combined to give a measure of the total workload of a task. The SWAT method divides the three previously named dimensions into three differ- ent levels. Where one would indicate a low level while three indicates the highest, e.g. a time load rating of one would indicate low levels of time load, where the user has a lot of time to perform the task while a rating of three would indicate very high level of time load where the user has no spare time and has to deal with overlapping activities. The first step of the SWAT methods is a card sorting process. Cards representing all different combinations of levels for each of the three dimensions are to be sorted and ordered from the combination that represent the lowest workload to the highest. The lowest would logically be a rating of 1, 1 and 1 for time load, mental effort load and psychological stress load respectively while the highest would be 3, 3 and 3. The steps in between would typically vary with users and tasks. The user would then perform a task and rate it on the three dimensions of workload. By seeing where this rating places in the order of the sorted cards, a weighted workload score ranging from 0 to 100 can be calculated. 4.7.4 Rating Scale Mental Effort The Rating Scale Mental Effort (RSME) methods is a simple, one dimensional sub- jective scale method for measuring mental effort required for a task [49]. It is more simple than a lot of other mental workload measuring methods due to the fact that it only requires the user to answer one single scale question. The scale is made up by a 15 cm long line with every 1 cm indicated. The line is accompanied by verbal descriptors of the level of mental effort, examples are "almost no effort" and "extreme effort". The position of the verbal descriptors along the scale has been carefully adjusted after many user tests during the initial development of the RSME method [49]. In comparison to other mental workload measurement methods the RSME lacks some of the more complex aspects that make up the total workload, it does not consider different dimension of mental workload like NASA-TLX, DALI and SWAT does [45]. 27 4. Methods 4.8 System Usability Scale The System Usability Scale (SUS) is a simple usability scale for subjective assessment of a system’s usability [6]. The SUS is made to be quick an allow users to very quickly convey their experienced usability of a system they have just used. The SUS is a likert scale and it utilizes ten 5-point scales ranging from strongly dis- agree to strongly agree. The SUS contains scales covering topics like the complexity of the system, integration of functions and whether it was cumbersome to use etc. A full example of the SUS including all scales can be seen in Figure 4.2. Figure 4.2: A full example of a SUS [6] The user’s inputted values on the scale goes through a calculation process where the scores are converted to lower values if they indicate bad usability or higher values 28 4. Methods if they indicate good usability. This is done by simply subtracting 1 from all even question and subtracting the score of all even questions from 5. After summarizing and multiplying with 2.5, a final SUS score between 0 and 100 emerges. 4.9 Subjective Assessment of Speech System In- terfaces The Subjective Assessment of Speech System Interfaces (SASSI) method, is a Likert scale based questionnaire for subjective evaluation of speech system interfaces [18]. The SASSI consists of 34 different scales related to the user’s experience with the speech interface [24]. Each scale is a seven point Likert scale. The scales are divided into six different topics: System Response Accuracy, Likeability, Cognitive Demand, Annoyance, Habitability and Speed [24]. A full SASSI questionnaire can be seen in Figure 4.3. 4.10 Eye Tracking Eye tracking is the process of measuring the eye movements in relation to different points of fixations in the world. Eye tracking can be used as a measure of visual attention. There are two types of eye tracking: automated and manual. Automated eye tracking refers to all eye tracking technology that automatically record and translate eye movements into data. Models specifically suitable for in- vehicle eye tracking are remote eye trackers, which do not require the user’s head to be locked in position. Makers of popular remote eye trackers include Tobii, EyeTribe, and SMI [32]. While automated eye trackers may benefit from the ad- vantages of technology, such as increased precision, they have their limitations, such as a smaller area of focus. Automated eye trackers may also experience issues with inconsistencies, accuracy, and precision of the collected data, which may require manual review of the collected data [32]. Moreover, these eye trackers may demand consistent lighting conditions and additional configuration, sometimes for each user [32]. Manual eye tracking refers to tracking and measuring user visual attention through the visual analysis of video recordings. This is done by having a researcher manually code or annotate segments of a video recording according to relevant glance areas, usually with the aid of some annotating software. With respect to driver attention, relevant glance areas may include parts of the road or areas of the vehicle’s interior. The precision of this method is lower compared to automated eye tracking, but allows for examining larger areas of glance interest. Moreover, manual eye tracking 29 4. Methods does not require advanced camera equipment or configuration. Cameras used in manual eye tracking should have sufficient video quality to see the user’s eyes under all expected light conditions. Depending on the desired time-precision of the eye tracking analysis, different frequencies of video capture may be considered. Automated and manual eye tracking each have their advantages and disadvantages. Automated eye tracking is most suitable for situations where the glance area is relatively small and requires high precision. For example, when examining areas of interest in the driver information module (DIM), the area behind the steering wheel. For larger areas of glance interest, such as multiple areas within a vehicle, manual eye tracking may be more suitable. Manual eye tracking also requires less setup, but requires additional labor to manually code eye glances in the video footage. 4.11 Affinity Diagramming Affinity diagramming is a method used for analyzing and structuring results from research [27]. The results are structured so that themes emerge allowing designers to better understand and categorize data, this ultimately leads to a good understanding of major problems or other important details [27]. The method is conducted by first letting all participant start writing down all rele- vant details gathered through research on notes. Each participants may have their own unique color on their notes to make them easier to distinguish. The notes are all put on a wall and the participants can then start moving them trying to group them into relevant groups and come up with group titles and even subgroup titles if they feel the need. A popular method for making affinity diagrams is the KJ method [27]. The KJ method is done in a similar way as the above written description but with a big emphasis that talking is not allowed while writing, placing and organizing the sticky notes. No speaking allows all participants to minimize any possible influence of group pressure [27]. 4.12 Wizard of Oz The Wizard of Oz (WOz) technique is performed by simulating a working prototype or system by letting a researcher or a "wizard" operate and control the prototype from behind the scenes [27]. Developing a fully working prototype is time and resource intensive. The WOz technique allows researchers and designers to evaluate a design concept without having to spend as much resources as building a fully functional prototype would have demanded [27]. 30 4. Methods From the user’s or test participant’s perspective, WOz prototypes and implemented features are indistinguishable. This is achieved by preparing system responses for potential paths of interaction in advance, so that the prototype operator, or wizard, can quickly respond to user input. For WOz prototypes to be successful, the pro- totype operator must be able to see or hear the user so that appropriate responses can be provided based on user input. Moreover, users should be unaware that the prototype operator is controlling the WOz prototype. The WOz technique has a long history with the development of speech recognition and voice user interfaces [16]. WOz prototypes can be used throughout the design process of voice interfaces and is invaluable for resource for understanding users’ vo- cabulary, utterance structures, and interactive patterns [16]. While WOz prototypes allow voice interface designers to bypass developing speech recognition systems to evaluation a design, the value of designed errors is not to be discounted. In fact, there are tools for creating WOz prototypes that randomly assign speech recognition errors to understand user reactions to such scenarios [23]. 31 4. Methods Figure 4.3: The SASSI questionnaire [24] 32 4. Methods Figure 4.4: Affinity diagram (partial) used to analyze qualitative data 33 4. Methods 34 5 Process This chapter describes the research and design process carried out as part of this thesis work. Several methods were carried out as part throughout the process, and the specifics of those methods with respect to the purpose of this project are discussed here. For details about the methods themselves, see Chapter 4: Methods. 5.1 Pre-study and Preparation The pre-study phase was the first phase of the research project and focused on a review of related research. This pre-study was done to understand what research has already been done and what gaps in the research exist which this project could aim to answer. In addition to developing a contextual understanding of the research area, the pre- study helped to identify methods and test setups that are frequently used when ex- amining distracted driving in terms of visual distraction and cognitive load. Methods for data analysis and theories related to attention and cognition were also identified. The knowledge gathered during the pre-study phase was used to plan the project execution. 5.2 Project Planning The project planning phase was focused on developing a schedule for the execution of the project. The distribution of time between the pre-study and preparation, project execution, and project finalization phases were based on recommendations from thesis examiners at Chalmers University of Technology and spread over a 20 week time period. During the project planning phase, methods were selected for their suitability to the research question developed in the pre-study phase. A full GANTT schedule of the project process with calender week numbers can be seen in Appendix A. 35 5. Process 5.3 Literature Review of Existing Guidelines During the pre-study phase, a brief review of three existing design guidelines was done to gain a general understanding of what each set of guidelines covered with respect to voice assistant interaction in vehicles. The guidelines reviewed in the pre- study stage were Android Auto Design Guidelines, Apple CarPlay Human Interface Guidelines, and Google Conversation Design [2, 1, 14, 13]. These guidelines were selected for review since they are directly tied to the two commercially available integration interfaces. To answer the first sub-question of the research question and understand what cur- rent guidelines exist for voice assistant interaction in vehicles, a more in-depth lit- erature review of the existing guidelines was required. This literature review aimed to summarize and understand the collective wisdom of the industry when it comes to in-vehicle voice assistant interaction. In addition to the guidelines once reviewed during the pre-study phase, this literature review also included Amazon Alexa De- sign Guide [1]. Although Amazon does not have an integration interface on the market, it has announced plans to do so in the coming years. A review of existing NHTSA guidelines was also done, as those guidelines specifically deal with traffic safety [30, 31]. The results of this literature review would be used in later phases to identify estab- lished guidelines which work well to decrease visual distraction and cognitive load. The review was also used to identify areas where the guidelines were not followed by existing voice assistants and to identify gaps in the guidelines with respect to distracted driving and voice assistant interaction. 5.4 Summative Evaluation In order to understand the efficacy of the existing guidelines, a summative evaluation of existing voice assistants in vehicles was conducted. The summative evaluation also served to identify any issues in the voice assistant integrated IVI that may contribute to visual distraction and increased cognitive load, thereby decreasing safe driving. The evaluation consisted of three parts: an on-road test, data collection and handling, and analysis of data collected from the test. A total of 8 test participants completed the on-road test. A ninth participant began an on-road test, but the test was ended prematurely due to concerns for traffic safety. The 8 test participants had been licensed drivers for a mean time of 12.6 years. Frequency of driving was evenly spread out among participants between driving every day to less than once a month. Only two participants had previous experience 36 5. Process with driving with PA, the Level 2 ADS used in the test. All but one participant had previous experience with VAs and the large majority of these previous experience were with VAs on smartphones, a screen-first solution. 5.4.1 On-road Test Setup The on-road test was done on public roads in Torslanda, Gothenburg. Test par- ticipants drove along a predefined route that measured 10.3 kilometers with round- abouts at each end which made for a continuous driving experience. Speed limits along the route varied between 50 and 70 kilometers per hour. Figure 5.1: Interior of the test car model, equipped with Android Auto and Apple CarPlay Test participants were recruited internally at Volvo but were not limited to employ- ees. Students and consultants placed at Volvo were also invited. Due to liability issues with the test car, only employees, students, and consultants with Volvo access could participate. The test car used was a Volvo V90 with automatic transmission and Pilot Assist (PA), a lane keeping and adaptive cruise control feature which makes it a SAE Level 2 autonomous vehicle. The implementation of the PA feature on the test car is common to other Level 2 vehicles. The V90 test car was also equipped with both Android Auto and Apple CarPlay. The on-road test was designed to compare the performance of the two voice assis- tants, Apple Siri and Google Assistant. The test was also designed to determine if there was a decrease in visual distraction and cognitive load when test participants were aided by Pilot Assist. Thus, there were four conditions for test participants to 37 5. Process complete: • Android Auto with manual driving • Android Auto with Pilot Assist • Apple CarPlay with manual driving • Apple CarPlay with Pilot Assist Under each condition, test participants were asked to perform 9 secondary tasks while driving. These tasks were selected due to their relation to categories of app enabled on integration interfaces. Moreover, functionality for all tasks exist on both voice assistants tested. The tasks were: 1. Open a new received text message 2. Send text message to a contact 3. Make a call to a contact 4. Make a call to a contact with multiple phone numbers 5. Play a genre of music 6. Play a specific song by a specific artist 7. Start navigation to a street address 8. Add a café to the current route 9. Start navigation to the nearest McDonald’s For each test, participants began by signing a consent form for their data to be collected and used for this project’s research. Next, they completed a survey about their previous experience with driving and using voice assistants. On the drive from Volvo Headquarters to the designated test route, test participants were trained on using PA and had a chance to get familiar with driving the car, with and without PA. Then, the test participants performed the 9 secondary tasks while driving along the test route for each of the four test conditions. The order of test conditions was randomized to minimize any bias from the order the test conditions were completed. Prior to starting each test condition, test participants were given training on the voice assistant for each condition in relation to the types of tasks they would be asked to perform. The order of the 9 tasks was not randomized, since some tasks built upon the output of a previous task and the overall difference in voice assistants was the focus of each test condition. After completing each driving condition, test 38 5. Process participants were asked to complete a DALI survey to assess the cognitive load of each test condition. After the test, participants were asked a set of follow-up questions about their overall experience in a semi-structured interview. For each task, participants were permitted up to 3 attempts in the case of task failure. Task failure is defined as the end of an interaction with the VA that does not trigger the desired intent or action. Task success is defined as the successful completion of a task using the voice assistant. For example, an utterance for Task 7 which results in navigation to the wrong address would be considered a task failure. Test participants were not required to make repeated attempts in the case of task failure. The survey about the participant’s previous experience can be found in Appendix B. A complete protocol of the on-road test can be found in Appendix C. The test schedule and the randomized condition permutations can be found in Appendix D. 5.4.2 Data Collection and Handling Visual distraction during the on-road test was measured by manual eye tracking, which is further described in section 4.10. The primary focus of this data was to distinguish between on-road and off-road glances. Moreover, cognitive load was assessed using DALI surveys completed during the on-road test. The DALI survey used can be found in Appendix I. Eye glance data was collected during the on-road test via three video cameras mounted throughout the car. The cameras recorded a view of the driver’s face, the IVI display, and the road. The three views of the cameras can be seen in Figure 5.2. Audio was included in the video recordings. These videos were then synchro- nized for each condition. The synchronized videos made it possible to code the eye glances of the test participant and understand that context of glances with the added road and IVI views. The synchronized video was then manually coded through a custom tool created in Matlab, one task at a time. The software used is seen in Figure 5.2. Task eye glance analysis began from the end the test facilitator’s prompt to complete the task to task success or the end of the last attempt to complete the task. Thus, task footage analyzed may include more than one attempt to complete a task. The tool allowed the video to be analyzed at a 30 Hz frequency. The tool made it possible to assign a glance code to each frame analyzed. Once the glance codes were assigned, duration for each glance was calculated in preparation for data analysis. Several codes were used to annotate the eye glance data. These codes were: 0. On-road Glances on the road 39 5. Process Figure 5.2: Eye tracking software and video with three camera views 1. IVI VA Inactive Glances at the IVI when the VA is not active 2. IVI VA Active Glances at the IVI when the VA is actively awaiting a driver utterance 3. IVI VA Processing Glances at the IVI when the VA is processing an utter- ance 4. IVI VA Response Glances at the IVI when the VA is presenting a response or prompt 5. DIM PA Off Glances at the driver information module (DIM) when PA is off 6. DIM PA On Glances at the DIM when PA is on 7. Miscellaneous Glances that are directed at the road, IVI, or DIM DALI data was collected for each test participant, for each test condition, totally 4 completed DALIs per test participant. Adjusted ratings from each DALI were calculated for the individual dimensions of the DALI. Combined, the adjusted ratings resulted in a weighted rating also used in later data analysis. DALI scores were weighted according to the established protocol described in Chapter 4. In addition to the quantitative data collected above, qualitative observational data was also collected. The synchronized videos were reviewed and qualitative observa- tions, such as emotional reactions and scenarios of high frustration, were recorded. Test participant answers from the debrief interview were also transcribed for quali- tative analysis. 40 5. Process 5.4.3 Data Analysis The data analysis was done in two parts: qualitative and quantitative. The qual- itative analysis deals with observational notes of the on-road test and transcribed answers from the debrief interview. The quantitative analysis concerns the eye glance and DALI data. Qualitative data from observation notes and interview transcriptions were combined and analyzed using the affinity diagramming method. This allowed for connections between different data points and recurring themes in the data to be identified. The insights from this analysis would later guide the development of new design guidelines and actualizations of these guidelines as prototypes. Figure 5.3: Affinity diagram (partial) of qualitative summative evaluation data The quantitative data was analyzed using Minitab, a statistical data analysis soft- ware. Eye glance and DALI data was plotted in order to identify any trends in the data. The plots where also used to help determine whether the different VAs had a discernible difference on visual distraction and cognitive load. The plots were also used to determine if the use of SAE Level 2 ADS, in this case PA, also had an effect on visual distraction or cognitive load. The results from the quantitative data analysis were also used to support and motivate new design guidelines for voice assistant interaction in vehicles. 41 5. Process 5.5 Prototype Development and Evaluation Following the summative evaluation, ideation for improvement to address the issues identified in the summative evaluation began. The ideation process resulted in two prototypes, Prototype 1 and 2. These prototypes embodied interstitial, new guidelines for voice assistant interaction in vehicles. These prototypes were then tested in a simulator. The results from the simulator test were then collected and analyzed. Both prototypes were tested by 10 participants, but eye glance data is only available for 9 participants due to file corruption. The participants has been licensed drivers for a mean time of 9.1 years. Participants were mostly infrequent drivers, driving every other week or less. Three test participants drove at least once a week. All but two participants had previous experience with VAs. The majority of participants had previously experienced VAs on smartphones. 5.5.1 Ideation and Prototype Development The aim of the ideation process was to come up with potential solutions to improve the problems in the existing guidelines and voice assistants. The ideation process began by narrowing the number of tasks that would be performed by the test par- ticipant during the simulation test. Tasks from the on-road test were recycled, but tasks which had little interaction and little data results, were removed, such as the task of calling a contact by name. A voice-first approach was taken, the ideation process focused first on generating many conversation dialogs for the test tasks. Eventually, two concepts emerged, which will be labeled as Prototype 1 and Proto- type 2. Conversation dialogs for both prototype were further refined to reflect the two concepts. This includes having multi-turn and one-shot dialogs for applicable tasks. Moreover, since errors were found to be a factor in visual distraction and cog- nitive in the summative evaluation, error handling was a key re-designed element in both prototypes. Both prototypes were then implemented in Adobe XD, as shown in Figure 5.4, which allows designers to assign a pre-written system response for each screen. Adobe XD is able to output both visual information as well as speech. The prototypes were developed with high-fidelity graphics for the purposes of using the Wizard of Oz (WOz) technique when testing the prototypes. The two prototypes shared a visual language, in order to focus on discerning any differences or preferences between to two voice interaction concepts developed. In order to later use the WOz technique with the prototypes, a control panel was designed for both prototypes. The control panel allows the wizard to control the flow of system responses to test participant input. During the simulator test, the 42 5. Process Figure 5.4: Prototype development in Adobe XD control panel would be hidden from the test participant. 5.5.2 Simulator Test Setup The two prototypes were tested in a truck simulator located at Chalmers Johan- neberg campus. Test participants drove along a highway, following in-game navi- gation directions in the trucking-driving simulation game Euro Truck Simulator 2. The game included multiple lanes of traffic, which participants were free to switch between while avoiding collision with any of the other in-game vehicles. Test participants were recruited through an online survey. Participants must have a valid driver’s license to participate. Upon completing the test, participants were compensated with a gift card for 250 kronor. The simulator setup included a large TV display positioned in front of the driver’s seat. The seat was a full adjustable car seat, which helped to acclimate experienced drivers to use the simulator. The simulator was also equipped with steering wheel, gear shift, and pedal game controls to drive the truck. The setup can be seen in Figure 5.5. The steering wheel was equipped with some force feedback to simulate bumps in the road and the simulator was set to an automatic transmission. The simulator was not equipped with any autonomous driver features. A Windows Surface Book was mounted to the right of the steering wheel to simulate the IVI. The simulated IVI displayed the two tested prototypes, one at a time, and 43 5. Process Figure 5.5: Simulator test setup with a dividing wall between the test participant and wizard (not to scale) was connected to the wizard computer by remote desktop. This allowed the wizard to control the prototype’s responses to test participant input on the fly. The wizard was situated in the same room as the test participant and test facilitator, but behind a partition so participants were not aware that the wizard was controlling the IVI prototype. Only one person acted as the wizard to minimize systematic bias. As shown in Figure 5.5, the control panel of the IVI prototype was hidden from the test participant but visible to the wizard. This control panel on the prototypes allowed the wizard to remotely control the prototypes in real time, in dir