Designing in-vehicle voice assistants
Creating safer, integrated driver experiences

Master’s thesis in Computer science and engineering

CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2019


Master’s thesis 2019

Designing in-vehicle voice assistants

Creating safer, integrated driver experiences

CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2019


Designing in-vehicle voice assistants
Creating safer, integrated driver experiences
CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN

© CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN, 2019.

Supervisor: Fang Chen, Department of Computer Science and Engineering
Advisor: Jenny Wilkie, Volvo Cars
Examiner: Staffan Björk & Olof Torgersson , Department of Computer Science and
Engineering

Master’s Thesis 2019
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: In-vehicle infotainment system with prototype voice assistant interface.

Typeset in LATEX
Gothenburg, Sweden 2019

iv


Designing in-vehicle assistants
Creating safer, integrated driver experiences
CONNIE (KHANH) NGUYEN AND WILLIAM FALKENGREN
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Voice assistants are increasing in popularity with the rise of devices like smart speak-
ers and screens. As people grow accustomed to using these assistants, it is likely they
would want use the same voice assistant in their car. Many modern cars already
support integration of voice assistants from both Apple and Google. In this project,
voice assistants integrated into the vehicle and their effects on safety in terms of
increased diverted attention and cognitive load are examined. Current voice assis-
tants are also reviewed. Apple Siri and Google Assistants, two commercial voice
assistants, are evaluated under the conditions of manual driver, as well as with
longitudinal and lateral assistive drive features. New, improved design solutions
and guidelines were evaluated through two prototypes with different approaches to
solving found problems in existing voice assistants. The results indicate several
similarities and differences in the existing design guidelines for the different voice
assistants. Users provide input and thoughts about the existing solutions. New de-
sign solutions for decreasing distraction and cognitive load are presented. These new
solutions can help continued research and further improvement of voice assistants
within cars in the future to come.

Keywords: Voice Assistant, Voice Interaction, Driving, Safety, Attention, Cognitive
Load, Design Guidelines.

v


Acknowledgements
We would like to thank all personnel of Volvo Cars who in any way participated
and helped with the project. We would like to especially thank Jenny Wilkie for her
expertise and all her supervision, feedback and guidance throughout the project. We
would like to thank all test participants who participated in our studies. We would
like to thank Chalmers for providing us with material and equipment and lastly, we
would like to thank Fang Chen for her supervision throughout the project.

Connie (Khanh) Nguyen & William Falkengren, Gothenburg, June 2019

vii


Abbreviations
NLP - Natural Language Processing
IVI - In-Vehicle Infotainment
HMI - Human-Machine Interface
VUI - Voice User Interface
ADS - Autonomous Driving System
NHTSA - National Highway Traffic Safety Administration
SAE - Society of Automotive Engineers
PA - Pilot Assist
VA - Voice Assistant
CSD - Center-Stack Display
DIM - Driver Information Module
AA - Android Auto
AC - Apple CarPlay

ix


Contents

List of Figures xv

List of Tables xvii

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Voice Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Voice Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Designing with Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Autonomous Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Distracted Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Guidelines for In-vehicle and Voice Interfaces . . . . . . . . . . . . . . 11

2.5.1 NHTSA Interface Guidelines . . . . . . . . . . . . . . . . . . . 11
2.5.2 Android Auto . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Apple CarPlay . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.4 Google Assistant . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.5 Siri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Theory 15
3.1 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Wickens’ Attention Model . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Intensive and Selective Attention . . . . . . . . . . . . . . . . . . . . 17
3.4 Cognitive Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Eye Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Elements of Voice Interfaces . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 The Cooperative Principle . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Methods 21
4.1 Wicked Problems and Iterative Design . . . . . . . . . . . . . . . . . 21

xi


Contents

4.2 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Summative and Formative Evaluation . . . . . . . . . . . . . . . . . . 22
4.4 Field and Lab Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 A/B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 Cognitive Workload Measuring . . . . . . . . . . . . . . . . . . . . . 24

4.7.1 NASA-Task Load Index . . . . . . . . . . . . . . . . . . . . . 24
4.7.2 The Driving Activity Load Index . . . . . . . . . . . . . . . . 26
4.7.3 Subjective Workload Assessment Technique . . . . . . . . . . 27
4.7.4 Rating Scale Mental Effort . . . . . . . . . . . . . . . . . . . . 27

4.8 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.9 Subjective Assessment of Speech System Interfaces . . . . . . . . . . 29
4.10 Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.11 Affinity Diagramming . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.12 Wizard of Oz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Process 35
5.1 Pre-study and Preparation . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Project Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Literature Review of Existing Guidelines . . . . . . . . . . . . . . . . 36
5.4 Summative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4.1 On-road Test Setup . . . . . . . . . . . . . . . . . . . . . . . . 37
5.4.2 Data Collection and Handling . . . . . . . . . . . . . . . . . . 39
5.4.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Prototype Development and Evaluation . . . . . . . . . . . . . . . . . 42
5.5.1 Ideation and Prototype Development . . . . . . . . . . . . . . 42
5.5.2 Simulator Test Setup . . . . . . . . . . . . . . . . . . . . . . . 43
5.5.3 Data Collection and Handling . . . . . . . . . . . . . . . . . . 46
5.5.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.6 New Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Result 49
6.1 Literature Review of Existing Guidelines . . . . . . . . . . . . . . . . 49

6.1.1 Designing Car Apps . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.2 Voice and Manual Input . . . . . . . . . . . . . . . . . . . . . 50
6.1.3 General Voice Responses . . . . . . . . . . . . . . . . . . . . . 50
6.1.4 Situation Awareness . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.5 Presenting Choice . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.6 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.7 Discoverability . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.8 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.9 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Summative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 53

6.3 Prototype Development and Evaluation . . . . . . . . . . . . . . . . . 59
6.3.1 Prototypes Developed . . . . . . . . . . . . . . . . . . . . . . 59

xii


Contents

6.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 New Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.1 Voice and Manual Input . . . . . . . . . . . . . . . . . . . . . 67
6.4.2 General Voice Responses . . . . . . . . . . . . . . . . . . . . . 69
6.4.3 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4.4 Situation Awareness . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.5 Presenting Choice . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.6 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.7 Discoverability . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4.8 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 Discussion 77
7.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.3 New Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8 Conclusion 85

Bibliography 87

A Project Plan GANTT Chart I

B Summative Evaluation Survey III

C Summative Evaluation Test Protocol V

D Summative Evaluation Test Schedule XIII

E Prototype Evaluation Test Protocol XV

F Prototype Evaluation Survey XXI

G Interaction Paths of VA Prototypes XXIII

H Prototype Evaluation Schedule XXVII

I DALI Survey XXXI

J SUS Survey XXXV

K Summative Evaluation KJ Results XXXVII

L Prototype Evaluation KJ Results XLI

M Summarized Existing Guidelines XLVII

xiii


Contents

xiv


List of Figures

2.1 The Android Auto GUI . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The CarPlay GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Google Assistant displaying results for nearby restaurants on a An-

droid phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Wickens’ Multiple Resource Model [48] . . . . . . . . . . . . . . . . . 16

4.1 Design funnel as described by Bill Buxton where dashed lines indicate
divergence and solid lines indicate convergence in the design process [7] 22

4.2 A full example of a SUS [6] . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 The SASSI questionnaire [24] . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Affinity diagram (partial) used to analyze qualitative data . . . . . . 33

5.1 Interior of the test car model, equipped with Android Auto and Apple
CarPlay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Eye tracking software and video with three camera views . . . . . . . 40
5.3 Affinity diagram (partial) of qualitative summative evaluation data . 41
5.4 Prototype development in Adobe XD . . . . . . . . . . . . . . . . . . 43
5.5 Simulator test setup with a dividing wall between the test participant

and wizard (not to scale) . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Video used for eye tracking . . . . . . . . . . . . . . . . . . . . . . . . 46
5.7 Affinity diagram (partial) of qualitative prototype evaluation data . . 47

6.1 Frequency of off-road glances . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Intervals of DIM and IVI glances during four conditions . . . . . . . . 54
6.3 The count of DIM and IVI glances during the four conditions. . . . . 55
6.4 Count of off-road glances by direction and condition. . . . . . . . . . 55
6.5 Count of off-road glances during the various test tasks. . . . . . . . . 56
6.6 Count of off-road glances during tasks with error indications . . . . . 57
6.7 DALI weighted rating of Android Auto and Apple CarPlay . . . . . . 57
6.8 Adjusted ratings of the individual DALI dimensions . . . . . . . . . . 58
6.9 Weighted ratings of manual drive and pilot assist . . . . . . . . . . . 58
6.10 Prototype 1, left, and Prototype 2, right, and their differences when

sending a text message . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.11 Prototype 1 and Prototype 2 with their differences in voice interaction

for showing results in a list . . . . . . . . . . . . . . . . . . . . . . . . 60
6.12 Frequency of off-road glance duration times . . . . . . . . . . . . . . 62

xv


List of Figures

6.13 Count of off-road glances by task . . . . . . . . . . . . . . . . . . . . 63
6.14 Count of off-road glances during tasks with error indications . . . . . 63
6.15 Count of off-road glances by task and prototype. . . . . . . . . . . . . 64
6.16 Weighted DALI ratings for Prototype 1 and Prototype 2 . . . . . . . 64
6.17 Adjusted rating of the dimensions of the DALI . . . . . . . . . . . . . 65
6.18 SUS score comparison with adjective ratings and acceptability ranges

[5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xvi


List of Tables

2.1 SAE levels of driving automation [41] . . . . . . . . . . . . . . . . . . 9

4.1 The NASA-TLX measurement factors and their descriptions [17] . . . 25
4.2 The DALI measurement factors and their descriptions [35] . . . . . . 26

6.1 Prototype similarities and differences . . . . . . . . . . . . . . . . . . 60
6.2 SUS Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xvii


List of Tables

xviii


1
Introduction

Voice assistants have exploded in popularity in recent years thanks to smart speak-
ers. Natural Language Processing (NLP) allows these smart speakers to communi-
cate with their users in a convenient and natural way and makes them suitable for
helping their users with a large and varied set of tasks. One report predicts that
47% of American homes will have a smart speaker by 2022 [34]. As people grow
accustomed to the voice assistants in their homes and on their phones, it is not un-
reasonable to assume that drivers will use the same voice assistant from their daily
lives in their cars. This possibility is in fact a reality in many modern cars that of-
fer voice assistant integration directly into the in-vehicle infotainment system (IVI)
through Android Auto or Apple CarPlay. Voice assistants, and voice interaction at
large, offer drivers an eyes-free, hands-free way to complete secondary tasks while
driving. However, voice assistants are susceptible to recognition issues and have
transient, paced interaction flows that require immediate response from the driver.
Despite the integration of voice assistants into vehicles, set guidelines for safe voice
interaction are not well defined. As voice assistant integrated IVIs become increas-
ingly prevalent, it is necessary to evaluate the existing commercially available voice
assistant integrated IVIs in relation to distracted driving.

1.1 Purpose

Voice interaction in vehicles have long been a topic of research [25]. However, many
of the previous studies on voice interaction in vehicles focus on evaluating voice
interactions developed by the OEM, such as the Chevrolet MyLink and the Volvo
Sensus [28, 39]. Generally, voice interaction has been found to be a safer alternative
to standard HMI inputs to the IVI [25]. However, the landscape of voice interaction
in vehicles is expanding with voice assistants developed by software giants including
Google and Apple. These voice assistants have seldom been examined in relation
to distracted driving and have yet to be studied in a setup where they are fully
integrated in an IVI. A deeper understanding of voice assistants’ effect on distracted
driving is critical as voice assistants increasingly integrate into available IVIs.

1


1. Introduction

Despite the heralded safety benefits of voice interaction, standards for safe voice
interaction in vehicles are largely undefined. The National Highway Traffic Safety
Administration (NHTSA), which has set guidelines and standards for safe manual
interactions with IVIs, has yet to publish similar guidelines for voice interaction [30,
31]. With many voice interaction systems already commercially available, designers
cannot continue to put off considerations for safe voice interactions while driving.
Voice assistants in particular introduce the possibility for third-party designers and
developers to create and distribute in-vehicle applications, such as navigation apps.
Today’s guidelines treat the design of the IVI and voice assistant as two separate
entities rather than one integrated voice-driven, multimodal experience [2, 14, 13].

As an added layer of consideration is the development of autonomous vehicles in par-
allel to voice assistants. Society of Automotive Engineers (SAE) Level 2 autonomous
driving offers drivers support in the primary task of driving through features such
as adaptive cruise control and lane keeping [41]. However, a misunderstanding of
how these systems work and their limitations may lead to overly trusting or relying
on these support systems, thereby causing drivers to divert their attention from the
primary task to a secondary one. The affect of voice assistants to complete sec-
ondary tasks in combination with driver support systems should also be considered
for safer interaction.

1.2 Aim

The primary aim of this project is to improve voice assistant interactions to com-
plete secondary tasks without compromising driver safety. Thus, it is necessary to
assess the current state of the art of voice assistant integrated IVIs commercially
available today and the design patterns they employ. These systems will be assessed
in relation to their effect on distracted driving, which includes visual distraction and
cognitive load. While voice interaction can reduce visual distraction, it has some
possible drawbacks. It is transient, meaning it is non-persistent, and it can poten-
tially increase the cognitive load of completing secondary tasks, such as mentally
visualizing navigation instructions in an unfamiliar area. Both visual distraction
and cognitive load must be carefully balanced to create what may be considered a
safe interaction while driving. This project also aims to produce design guidelines
for multimodal voice assistant-driven interactions for performing secondary tasks.

1.2.1 Scope

This project is limited to the performance of secondary tasks using a voice assistant
in a vehicle equipped with a voice assistant integrated IVI. Guidelines for currently
existing integrated voice assistants will be evaluated and a new set of guidelines
will be suggested. The new suggested guidelines will consist of currently existing

2


1. Introduction

guidelines as well as new guidelines developed in this project. The project is further
delimited to situations with a driver using a Level 2 or lower autonomous passenger
vehicle with no additional passengers. The drivers are defined as civilian drivers,
people who may drive as part of their daily commute, but not those who drive for
extended periods of time as part of their profession, such as a taxi cab driver or a
cargo trucker. As such, the design guidelines that will be produced as a result of
this project may be limited to driving scenarios that also match the scope of the
project. As this project considers the current state of voice assistants in vehicles,
the produced guidelines would be directly applicable for the near future.

As autonomous driving improves, reaching high or full level of automation, the types
of tasks users will perform in the vehicle will likely shift to be more entertainment
focused. However, even with the advent of fully autonomous vehicles, complete
market adoption of such vehicles will not happen overnight. Thus the guidelines
produced by this project will remain relevant for the remaining vehicles on the road
that are Level 2 and under.

1.2.2 Stakeholders

This project is carried out as part of a larger project known as SEER (Seemless,
Efficient and Enjoyable user-vehicle inteRaction). SEER is a joint collaboration
between Volvo Cars, Volvo Technology, RISE Viktoria, and Semcon; the project is
funded by Vinnova [44]. The SEER project is focused on improving the experience
of completing secondary tasks in low-level autonomous vehicles (up to SAE level 2).
General findings and projects developed under the umbrella of SEER are available to
the public to promote knowledge sharing and innovation in the automotive industry.

1.2.3 Ethical Concerns

This project will use on-road tests to assess the current state of voice assistant
integrated IVIs. In such testing conditions, test participant safety is paramount
and shall take precedent over the test itself. Additional measures, such as using a
specially equipped vehicle for facilitator interjection may be necessary in the interest
of safety.

An additional concern is the handling of personal user data. The commercially
available voice assistants send and retrieve data to and from external servers owned
by parties outside of this project, such as the voice assistant author company and
third-party app services. Also concerning personal user data is the collection of data
as video footage. Video footage collected as part of this project was done so with full
consent from the test participants, where participants also had the option to have
any personally identifying footage removed once the collected data was analyzed.

3


1. Introduction

1.3 Research Questions

In the context of SAE Level 2 (and lower) vehicles, this project addresses the fol-
lowing questions:

1. What adjustments to existing NLP-based voice assistant design guidelines
should be made for safer interaction while driving?

(a) What existing design guidelines and patterns are implemented in voice
assistant integrated infotainment systems?

(b) What improvements to existing voice assistants can be made to minimize
diverted attention from the primary task of driving?

(c) What improvements to existing voice assistants can be made to minimize
cognitive load while executing a secondary task during the primary task
of driving?

In the primary research question, safer interactions are defined with respect to how
well they comply to NHTSA design guidelines for human-machine interfaces (HMIs)
and reduce distracted driving [30, 31]. While the NHTSA guidelines explicitly do
not consider voice interaction, they may serve as a starting point as the infotain-
ment systems examined are multimodal. Moreover, NHTSA has yet to define safe
interactions with respect to voice interaction, though it has plans to do so in the
future. The current NHTSA guidelines related to this project are covered in section
2.4 and 2.5.1.

4


2
Background

This project delves into several areas including voice interaction, autonomous cars,
and distracted driving. This chapter provides a brief history of each area and pre-
vious scientific work researching the intersection of all three.

2.1 Voice Interaction

The field of voice interaction has experienced a recent increase in interest thanks
to the introduction of smart speakers to market. However, voice interaction long
predates smart speakers and early interactive voice systems were first introduced in
the 1990s [22]. These early systems were known as finite state voice user interfaces
(VUIs). VUIs are typically categorized as either finite state or natural language
processing (NLP), but hybrids do exist [22].

Finite state VUIs are characterized by a limited set of commands for each point
in the interaction flow, typically in a tree menu [22]. Most people encounter finite
state VUIs on the phone, in the form of automated customer service systems. These
systems are usually met with frustration as many users have difficulty finding the
information or action they want in a tree menu.

Natural language processing VUIs improve upon their finite state predecessors by
recognizing a wider array of user input for the same action through statistical lan-
guage modeling. Prime examples of NLP VUIs are the voice assistants available on
smartphones and smart speakers. Voice assistants typically process voice input off-
site through cloud-computing. The most popular voice assistants include Amazon
Alexa, Google Assistant, Apple Siri, and Microsoft Cortana. These voice assistants
allow users to interact with them in a more natural, conversational manner. More-
over, thanks to their off-site processing, voice assistants can learn and improve over
time as more users interact with them [22].

5


2. Background

2.1.1 Voice Assistants

Voice assistants take NLP VUIs to the next level. Not only can they understand and
respond to conversational input, they can use information about the user to provide
relevant responses. For example, users can ask a voice assistant, "What are upcoming
Robyn concerts?" and the assistant can respond with upcoming concert dates in
the user’s city with the option to hear about concerts in other cities. However,
not all tasks completed through a voice assistant take advantage of this contextual
information and may require users to repeat information, introducing frustration
into the input process.

Figure 2.1: The Android Auto GUI

Figure 2.2: The CarPlay GUI

Voice assistants have recently made the leap from the phone to the car through IVI
integration interfaces. Integration interfaces allow drivers to connect their smart-

6


2. Background

phone to their car’s IVI, enabling drivers to access some of the functionality of their
phones directly on the IVI, including voice assistants. Two platforms that offer IVI
integration interfaces are Google and Apple. Google’s Android Auto allows drivers
to connect their Android phone and Google Assistant to the IVI. The Android Auto
home screen can be seen in figure 2.1. Similarly, Apple CarPlay allows iOS users
to integrate their iPhone and Siri assistant into the IVI. The Apple CarPlay home
screen can be seen in figure 2.2. Both Android Auto and Apple CarPlay enable
drivers to use select apps from their phone on the IVI. Only apps which belong to
an enabled category and have been developed for in-vehicle use may be available
on the IVI. Integration interface authors, such as Apple and Google, dictate which
categories may be enabled.

Categories enabled on both Android Auto and Apple CarPlay are communication,
navigation, audio, and automaker [2, 13]. The communication category includes
apps with messaging and VoIP calling features. Navigation apps allow drivers to
locate points of interest and provide driving directions. Audio apps cover an array of
audio services which include music streaming services, podcast stations, and sports
news. Automaker apps allow drivers to get information about their car and adjust
car settings through the integration interface. If a driver’s voice assistant is enabled
on the phone, then enabled apps may be used via voice assistant. However, the
degree of voice interaction is left to the discretion of the app developer. If a developer
has chosen not to include voice interaction, then some features of the app may not
be steered by the voice assistant, instead requiring manual interaction.

2.2 Designing with Voice

While there are pure VUIs, only allowing voice input and output, many interfaces
provides multiple modes for both input and output. Screens, keyboards, and other
types of input are combined with voice to produce multimodal interfaces. There
are several approaches to using voice in interaction which can be categorized as
screen-first, voice-only, and voice-first [10, 47].

The screen-first approach prioritizes the screen first and utilizes voices to enhance
screen functionality [47]. The screen-first approach is currently applied to most
smartphones as the voice assistants are highly dependant on the screen. In many
cases, the user is unable to complete a voice-initiated interaction without manual
input through the screen [47]. For example, if a user requested nearby restaurant
recommendations, a screen-first system may read aloud the first recommendation
and output the remaining alternatives on the screen for the user to manually select
an option. Screen results from asking Google Assistant for nearby restaurants can
be seen in Figure 2.3.

A voice-only interaction uses only voice for both input and output, unlike screen-first
and voice-first. Early screenless smart speaker models such as the Amazon Echo,

7


2. Background

Figure 2.3: Google Assistant displaying results for nearby restaurants on a Android
phone

Google Home, and Apple HomePod are examples of voice-only design [47]. Due to
the singular mode of input and output, using voice-only interactions to complete
simple tasks can become tedious [47].

A voice-first approach is the inverse of the screen-first approach. In a voice-first
design, a complementary display is used to visually supplement the voice interac-
tion and a user can complete an interaction through voice alone [47]. Voice-first has
been widely embraced in the latest models of voice assistants like the Amazon Echo
Show and the Google Home Hub which include touchscreens. The voice-first ap-
proach is different in that many traditional graphical user interface elements, such as
heavily-nested menus and visually dense content, are completely eliminated in favor
of contextualizing information to enhance whatever the voice is communicating [47].
Moreover, a voice-first approach assumes that the user may not always have access
to look at or touch the screen; therefore, voice carries the bulk of the interaction in
a voice-first approach [47].

8


2. Background

2.3 Autonomous Cars

As the field of voice interaction continues to develop, so does the field of autonomous
driving. According to the SAE International, there are six levels which describe the
level of autonomous driving a car is capable of, as shown in Table 2.1. It is worthy
to note that Levels 2 and under still require a human driver to perform part or all of
the driving task, even with the autonomous driving system (ADS) engaged [41]. In
contrast, vehicles classified as Level 3 and up are able to fully takeover the primary
task of driving, under varying scenarios [41].

Table 2.1: SAE levels of driving automation [41]

Level Autonomous Driving System Role
Human driver monitors driving environment
Level 0
No Driving Automation

Does not perform any of the driving task on a
sustained basis

Level 1
Driver Assistance

Performs part of the driving either in the lon-
gitudinal OR lateral motion and can be disen-
gaged immediately upon driver request

Level 2
Partial Driving Automation

Performs part of the driving in both the lon-
gitudinal AND lateral motion and can be dis-
engaged immediately upon driver request

Autonomous Driving System monitors driving environment while engaged

Level 3
Conditional Driving Automa-
tion

Performs all of the driving under select driver-
manageable conditions and can be disengaged
immediately by the driver or issue a request
for the driver to intervene

Level 4
High Driving Automation

Performs all of the driving under most diver-
manageable conditions and may delay driver-
requested disengagement

Level 5
Full Driving Automation

Performs all of the driving under all driver-
manageable conditions and may delay driver-
requested disengagement

For Level 2 and under autonomous driving, ADS can provide drivers assistance with
the primary task of driving. This can in turn free up some of the driver’s attention
and cognitive load to complete secondary tasks, such as tuning the radio, replying
to a text message, or getting directions to a nearby point of interest. However,
in Level 3 and up autonomous driving, handing off the primary task of driving
to the ADS from the human driver may introduce a new interaction paradigm in
the vehicle. In scenarios where the human driver is no longer responsible for the
driving, the primary task may shift dramatically from driving to other tasks, such
as entertainment or work.

9


2. Background

2.4 Distracted Driving

The National Highway Traffic Safety Administration (NHTSA) is the U.S. govern-
mental agency responsible for setting and enforcing safety standards in vehicles [30].
In 2016, the NHTSA reported that 3,450 deaths in the United States were report-
edly due to distracted driving [29]. The year prior, a staggering 391,000 people
suffered injuries from distracted driving related incidents [29]. With statistics like
these, distracted driving has become a key traffic safety issue.

According to the NHTSA, distracted driving refers to the inattention of drivers from
the primary task of driving to other activities or secondary tasks [30]. Electronic
devices in particular are an area of concern for the NHTSA as more and more
technology is incorporated into modern vehicles. Electronic devices can influence
drivers by causing visual distraction, manual distraction, and cognitive distraction
[30].

In an effort to combat distracted driving from electronic devices, the NHTSA has
thus far issued two phases of guidelines for designing in-vehicle electronic devices.
Phase One of the design guidelines concerns the design of original equipment (OE),
such as the in-vehicle infotainment system that already comes installed on a vehicle
[30]. Phase Two extends the guidelines from the first phase to include portable
and aftermarket devices, which includes smartphones with a car mode [31]. Both
guidelines use eye glance metrics as acceptance criteria where eye glances away
from the road for more than 2.0 seconds are correlated with an increased crash risk
[30, 31]. While both Phase One and Phase Two acknowledge voice interaction as
an alternative to traditional HMIs, both guidelines explicitly do not include voice
interaction. The NHTSA has announced plans for Phase Three of the guidelines,
which would provide recommendations specifically for voice interaction; however,
there is currently no set date for when these guidelines will be published, leaving
the definition of safe voice interaction in vehicles largely undefined.

While the jurisdiction of the NHTSA is limited only the United States of America,
its safety recommendations extend beyond those borders. In following the guidelines
set by the NHTSA for vehicles in the American market, car manufacturers in practice
also apply these guidelines to vehicles in markets outside of the United States. Alter-
nate guidelines for designing in-vehicle interfaces include those published by Japan
Automobile Manufacturers Association (JAMA), Alliance of Automobile Manufac-
turers (AAM), and the EU [20, 26, 9]. However, the NHTSA guidelines are the most
recent guidelines and likely the most relevant when considering voice interaction as
an emerging technology.

10


2. Background

2.5 Guidelines for In-vehicle and Voice Interfaces

At present, few guidelines consider the holistic interface of a voice-assistant inte-
grated IVI. However, the existing guidelines for both in-vehicle and voice interfaces
outline important considerations for each respective interface that should be taken
into account.

2.5.1 NHTSA Interface Guidelines

Phase One of the NHTSA interface guidelines are applicable to original IVIs [30].
Recommendations in the Phase One guidelines include where to place the IVI, what
tasks should not be allowed on the IVI, and IVI response time. The guidelines also
describe a number of best practices for interacting with an IVI manually. Some
notable interaction guidelines include single-handed operation, interruptibility, and
disablement [30]. Drivers should be able to operate the IVI with a single hand and
while driving and the IVI should not require the driver to complete an uninterrupted
sequence of tasks [30]. Drivers should be able to stop a task mid-way and then
resume the task if not completed [30]. Additionally, IVIs should have the ability to
disable the display of any non-safety related information through methods including
dimming, blanking, or changing the state of the display [30].

Phase Two of the NHTSA guidelines expand upon those covered in Phase One to
include the interfaces of portable and aftermarket devices [31]. Notable additions
from the Phase Two guidelines include pairing devices, driver mode, and access to
emergency services and alerts [31]. For devices that can be paired with the original
IVI, the pairing and disconnection should be easy to complete. When paired and
using the IVI display, guidelines from Phase One should also be followed [31]. For
unpaired devices, there must be a driver mode which conforms to the Phase One
recommendations [31].

As the second set of guidelines are an expansion, portable and aftermarket devices
described in Phase Two of the guidelines must also follow the guidelines defined
in Phase One. Notable additions from the Phase Two guidelines include pairing
devices, driver mode, and access to emergency services and alerts [31]. In both
scenarios, emergency services and alerts must be easily accessible [31]. However, the
guidelines do not state what additional notifications should also be accessible, such
as communication notifications.

11


2. Background

2.5.2 Android Auto

Android Auto is the integration interface made available by Google for compatible
Android phones. Only apps which fall into the navigation, communication, media,
or automaker categories can be enabled for use through Android Auto [13]. The
Android Auto design guidelines are primarily concerned with the appearance and
structure of visual content on the IVI. Android Auto uses a global UI, which means
the visual interfaces of each app uses a template provided by Google [13]. By
using a template approach, drivers using Android Auto do not need to learn app-
specific UIs when switching between two apps in the same category. The Android
Auto guidelines make almost no mention of designing for voice interaction, save for
constructing or replying to a message [13].

The Android Auto guidelines prescribe recommendations for user input, menu or-
ganization, and notification display. The pace of input into the IVI should be deter-
mined by the user [13]. This recommendation aligns with the NHTSA guidelines for
interruptibility. The Android Auto guidelines also suggest items in the drawer menu
be context specific [13]. For example, rather than displaying broad categories such
as "All Songs" and "All Artists" the menu items should be more specific such as "Top
Hits" or "Favorite Artists". The guidelines also state that notifications may be used
if they are appropriate to driving or important enough to interrupt the driver [13].
However, Android Auto provides little guidance on what is considered "important
enough" and leaves it up to the discretion of the designer.

2.5.3 Apple CarPlay

Like Android Auto, the Apple CarPlay guidelines use a global set of UI elements
and a template system [2]. Voice integration is briefly described for automaker and
communication apps, though Apple does have a separate guideline for custom Siri
voice commands [2]. When CarPlay is active, interactions on the iPhone should
be eliminated and CarPlay interactions should never require input from the iPhone
[2]. The Apple guidelines also provide a number of test conditions for designing a
CarPlay enabled app [2]. For example, apps should be tested in an actual car, not
a simulator alone, and in varying network conditions [2].

Generally, the Apple CarPlay guidelines provide more guidance to designers re-
garding the architecture of apps including badging, error handling, and navigation
structure. The Apple CarPlay guidelines also provide detailed recommendations for
content writing, organization, and notifications. Written content in CarPlay should
be succinct and avoid accusatory or judgmental tones [2]. Content and navigation
should require as few inputs as possible, either through flat or hierarchical naviga-
tion [2]. Moreover, there should only be one path for manual input to a specific
view [2]. Alerts should be minimized and used only when there is error so users will
take them seriously [2].

12


2. Background

2.5.4 Google Assistant

Google’s design framework for voice interaction is called Conversation Design. It
is an extensive framework with a lot of detailed information and examples. Google
highlights the framework as being multimodal and consisting of many different dis-
ciplines of design such as voice, audio and visual design. Google argues that all of
these disciplines are required to design real conversations as, according to them, real
conversation is a multimodal activity.

The Conversation Design framework is built upon Grice’s Cooperative Principle.
This principle states that conversation is shaped by the social context and that this
shaping of the conversation relies on a type of subconscious cooperation between the
conversing parts, Grice’s Cooperative Principle is covered in depth in section 3.7 in
this report.

The design framework provides extensive guidelines regarding the aspects of context
of conversation, variations of phrases and turn-taking during dialogues. A shorter
list of visual components to be used together with voice assistants is also provided.
Information regarding how and when graphical components are to be used in com-
bination with conversation is however very limited and the few guidelines related to
this that exists, are very general.

2.5.5 Siri

The Siri voice guidelines describe how to integrate the voice assistant in a variety of
contexts for a seamless voice-driven experience [3]. Moreover, the guidelines describe
when Siri would enhance an interaction and how to create Siri responses. The Siri
framework supports shortcuts which can perform useful or frequent actions without
much navigation [3]. Shortcuts should be short and concise, but also not context-
specific [3]. An example shortcut could be "Order clam chowder". Designers can
make shortcuts more relevant and accurate using custom vocabulary or providing
examples on the screen [3]. Like Google Assistant, Apple recommends that Siri
responses are conversational. Apple additionally recommends that actions should
be voice-driven with as little manual input as possible, a voice-first approach. Verbal
responses from Siri should be accurate and relevant to the user’s request [3].

2.6 Related Research

Voice interaction in vehicles has been well-researched in terms of distracted driving
and usability. However, as the voice interfaces continue to evolve, so do research
opportunities in the field. Previous research of automotive VUIs has generally fo-

13


2. Background

cused in-vehicle VUIs. In other words, VUIs that are built into the car by the OEM,
instead of portable alternatives such as modern day voice assistants.

In their 2013 review, Lo and Green surveyed key researched in-vehicle VUIs [25].
The VUIs covered by Lo and Green all used NLP, but they did not utilize cloud-
computing as voice assistants do [25]. Core functionality between the systems sur-
veyed included communication, media, and navigation, not unlike the enabled app
categories on both Android Auto and Apple CarPlay [25]. However, some of these
systems had extended functionality, such as climate control via voice command [25].

More recent studies compared different VUIs against each other to identify the effect
of different voice-driven multimodal interactions on distracted driving. Mehler et
al. compared the Chevrolet MyLink and Volvo Sensus against each other, where
the former allows for ‘one-shot’ voice input while the latter requires input through
a series of menus and sub-menus [28]. For most tasks, ’one-shot’ input performed
better than guided, menu-based input given no recognition errors [28]. However, if
there were recognition errors, the ’one-shot’ input, similar to that of current voice
assistants, increased driver workload and caused user frustrations [28]. Reimer et
al. further expanded upon the work by comparing a Samsung S-Voice assistant
against the two in-vehicles systems evaluated by Mehler et al. [28, 39]. Reimer et al.
found that the smartphone assistant actually performed worse that the embedded in-
vehicle systems [39]. However, they proposed that perhaps coupling the smartphone
into the embedded IVI to create one holistic experience may reduce workload and
visual demand [39].

One study that does examine the holistic experience of a voice-assistant integrated
IVI on distracted driving was conducted by Strayer et al. for the AAA Foundation
for Traffic Safety [42]. Motivated by the lack of Phase Three guidelines from the
NHTSA, this study investigated how Apple’s Siri affects distracted driving [42].
The study found that the use of a voice assistant to carry out a secondary task
significantly increased the crash risk; however, the study has yet to be corroborated
and does not provide suggestions to address the issue of increased risk [42].

Beyond voice interaction, distracted driving has been studied in many capacities. A
2009 review by Bach et al. surveyed 100 papers related to attention understanding
within automobiles [4]. Despite the extensive studying of attention and cognitive
load while performing secondary tasks, the review makes it clear that there is no one
singular method for assessing attention and cognitive load [4]. Previous studies have
used primary task performance, secondary task performance, eye glance behavior,
physiological measures, and subject assessments to measure attention and cognitive
load [4]. The variety of methods and the lack of a singular standard illustrate the
difficulty in capturing and measuring what goes on in the mind while performing
multiple tasks.

14


3
Theory

Voice interaction, especially for in-vehicle use, sits at the crux of many fields includ-
ing design research, attention, cognitive load, and linguistics. This chapter covers
the theory and domain-specific knowledge from these fields that are related to this
project.

3.1 Research Approach

The research approach of this project relies on human-centered design (HCD), where
user involvement and testing with users is central to the design and development of
a product. The idea that the solutions to a problem is held within the very people
who face this problem is a core idea of HCD [19]. Social research principles also
support the frequent user involvement in this research project. One prevalent idea in
the social research approach is that if enough people agree on a subjective opinion, it
can become an objective fact [46]. This can be said of the design field, where many
designers and researchers consider involving users as part of the design process or
design research to be standard, thus objectively validating HCD as a approach.
HCD largely focuses on understanding the users and evaluating with and for users
throughout the process [12]. Another characteristic of a HCD approach is applying
a wide range of disciplinary skills and perspectives [12]. This project especially
applies theory from psychology and cognition to be able to properly research the
user’s attention and cognitive load. Applying a varied set of theories and concepts
from different fields is, according to Gaver, a way to both inspire and articulate new
and already existing designs [11].

3.2 Wickens’ Attention Model

The attention of a human being is a limited resource. When it comes to the task
of driving and all of the secondary tasks that follow in a modern car, managing
attention and distributing it correctly becomes very important. There are various

15


3. Theory

theories explaining the complexities of human attention resources. One which has
proven to be especially relevant to mental workload in relation to multitasking is
Wickens’ Multiple Resource Theory [48]. According to this theory, the attention of
humans can be divided into different resource pools. The different resource pools
represent the humans ability to process different types of stimuli. The internal
processes are divided into perception, cognition and response. Figure 3.1 shows a
four-dimensional model of the resource model.

Figure 3.1: Wickens’ Multiple Resource Model [48]

According to Wickens, humans are able to perceive four different types of input:
spatial-auditory, verbal-auditory, visual-spatial and visual-verbal [48]. Multiple Re-
source Theory posits that multiple simultaneous inputs are better perceived if they
are of different types. When internal mental processes move from the perception of
input to the cognition of it, humans are capable of simultaneously processing verbal
and spatial input. In the final internal process, humans are capable of deciding a
response to manual-spatial and vocal-verbal input at the same time. However, the
ability to simultaneously process input is still affected by the weight and complexity
of the individual inputs. This means that very complex spatial input will affect a
person’s ability to process other input at the same time, even if the additional input
is of another modality.

Multiple Resource Theory helps to reinforce the findings of previous research which
concludes voice interfaces as being a safer input method in vehicles [28, 39, 25]. Ac-
cording to the theory, verbal information from a voice interface would never interfere
with the visual information from looking at the road as both inputs are processed
in the driver’s mind.

16


3. Theory

3.3 Intensive and Selective Attention

Another theory for explaining human attention is Kahneman’s work on effort and
attention [21]. According to Kahneman, the two most important factors affecting
attention are intensity and selectivity [21].

Intensity is directly connected with the effort one applies to their current focus of
attention [21]. A person may direct greater effort into a specific focus of attention
when motivated by arousal or personal choice [21].

Selectivity describes how a person decides to distribute their effort toward different
sources of attention [21]. Ultimately, the total amount of effort available at a given
moment is limited [21].

Problems occur when different sources of attention and their demand of effort inter-
fere with each other. This explains the difficulty behind dividing attention, such as
in multitasking. The idea of interference in distribution of attention is interesting,
as it provides a contextual explanation of the ideas presented by Wickens’ Atten-
tion Model which were summarized in section 3.2. explaining the difficult task of
dividing attention.

3.4 Cognitive Load

There are many different definitions of cognitive load, sometimes also referred to as
cognitive workload. Waard decomposes cognitive workload into two parts: demand
and load [45]. Demand is the specific external task demand a task places upon a user.
Load is the individual effect of the task demand placed upon a user. Task demand is
highly dependant on the complexity of the task. Increased task complexity increases
the demand of the task. Perceived load is more complex and depends on a variety of
factors including skill, experience, and current mood of the person performing the
task. When examining cognitive load, both task demand and task load should be
considered, as the two are closely related. In a driving situation, the main task of
driving places a certain demand on the driver. Depending on the driver’s skill level
and experience, the perceived load will vary. When adding secondary tasks, like
making phone calls and playing music, the total load of the driver further increases.

When analyzing the cognitive load, there are several aspects to consider. Cognitive
load essentially is a measure of how many mental processing resources are available.
The upper limit of resources is referred to as the capacity [45]. In a practical scenario
where cognitive load is measured, a researcher tries to measure how many resources
are available and how close the test participant is to their capacity limit. In a driving
scenario, the driver always needs to have enough resources to handle the primary

17


3. Theory

task of driving.

In section 3.2 the concept of attention resources was introduced. Attention resources
are closely related to mental processing resources. Perceiving input, the first step
of the previously mentioned Wickens’ Attention Model [48], is a prerequisite to pro-
cessing input through the consumption of mental resources. The stages of cognition
and response in Wickens’ model correspond with the mental processing concepts
that are central to discussing cognitive load.

3.5 Eye Movement

In order to assess visual distraction, it is important to understand how to analyze
a person’s eye movements, through four basic movements. These four eye move-
ments are saccades, smooth pursuit movements, vergence movements, and vestibulo-
occular movements [37].

Saccades is the most basic type of movement. Saccades are quick movements that
occur when a person changes their eye’s fixation point from one to another. [37].
Saccades may be short or long depending on the situation. When driving, the
moment between saccides can be interesting to analyze as the user’s fixation points
are likely to switch between on-road and on the various interfaces within the vehicle.

Smooth pursuit movement occurs when a person fixates their view on a moving
object. Smooth pursuit movement is difficult to perform without a moving object.
Attempts to perform this eye movement by the untrained may actually instead be
a series of short saccades [37].

Vergence movements occurs when a person fixates on a point that moves either
closer or further away from the person [37]. Vergence movements are different from
the two mentioned above, since the eyes during this movement moves in different
directions from each other compared to moving in the same direction during saccades
and smooth pursuit movement [37].

Vestibulo-ocular movements are made in order to stabilize the eyes during move-
ments from the outside world such as fixating on a point while the head is moving
in some direction [37].

When working eye tracking, several types of data can be analyzed. One type of
data is glances. A glance is a fixation on a specific point in the world between two
saccades. By this definition, glances have both a duration and a direction. With
respect to this project, glances are a highly relevant type of data as they are used
in part by the NHTSA to define safe task interactions [30]. Glance directions can
be divided into glance areas of interest in order to more easily measure glances on
specific areas of interest within the car.

18


3. Theory

3.6 Elements of Voice Interfaces

To understand and discuss VUIs, it is important to know the basic elements of a
voice interface. These elements are: utterances, responses, prompts, and intents.
Together, these elements create a dialog, a linguistic exchange between the user and
the VUI [16].

An utterance is a natural unit of speech which can range from a single word to a
small cluster of sentences [16]. With respect to VAs, utterances are usually inputs
from the user.

A response is the second utterance in a summons/response pair [16]. If a summons
is a request from a VA user, such as "What’s the weather today?" then a response
manifests as information related to the day’s weather.

A prompt is a system utterance that helps guide user input [16]. Prompts are most
often in the form of questions which can be explicit ("Which flowers would you
like to order, roses or daisies?"), implicit ("Which type of music would you like to
listen to?"), or open-ended ("What can I do for you?") [16]. Inferential prompts
are typically statements that convey to the user the capabilities of the VUI ("I can
answer questions about train arrivals, departures, and on-board amenities.") [16].

An intent is a representation of action or a feature that fulfills a user’s spoken
request. Intents may include variable information to complete a user’s request. In
the previous example for responses, the intent is to get weather information where
"today" was a variable that enables the VUI to respond with relevant information.

Utilizing these elements, and mimicking a VUI’s way of processing these, will be
necessary when trying and testing Wizard of Oz style prototypes.

3.7 The Cooperative Principle

NLP VUIs aim to function through conversation between the user and the system.
To design computers to converse in a natural way, VUI designers must understand
the underlying principles of conversation. The semantics of conversation have been
carefully studied by H. Paul Grice who has defined the underlying mechanics of
conversation through a set of principles [15]. Together, these principles are know
as The Cooperative Principle, which is made up of four sets of subprinciples or
maxims [15]. The maxims describe the subconscious cooperation the occurs as a
person formulates sentences in a conversation [15]. Grice’s Maxims are as follow
[15]:

19


3. Theory

Quality

1. Make your contribution as informative as required (for the current purposes
of the exchange).

2. Do not make your contribution more informative than is required.

Quantity

1. Try to make your contribution one that is true.

(a) Do not say what you believe to be false.

(b) Do not say that for which you lack adequate evidence.

Relation

1. Be relevant.

Manner

1. Be perspicuous.

(a) Avoid obscurity of expression.

(b) Avoid ambiguity.

(c) Be brief (avoid unnecessary prolixity).

(d) Be orderly.

These maxims can be used to formulate the output of a VUI. They can also be
applied when designing the VUI to anticipate different user inputs and how the sys-
tem should respond to them. This applies for designing the dialog of any prototypes
developed as a part of this project.

20


4
Methods

This chapter covers all methodology relevant to the project. Usage details regarding
the methods, suitable contexts of use and alternative methods are discussed. The
methods are varied ranging from purely evaluative to creatively stimulating and can
be utilized at different points throughout the project.

4.1 Wicked Problems and Iterative Design

Many of the challenges and problems designers aim to solve are known as wicked
problems. Rittel and Webber were the first to define wicked problems, which are
problems that are unique, have no definitive formulation, have no stopping rule and
whose solutions are not true-or-false but good-or-bad [40]. By comparison, there
are tame problems which have a definite formulation and solution, such as math
problems which have stopping rules to indicate when a solution has been reached
and equations by which the solution can be verified as true or false. Solutions to
wicked problems are rated on a scale of good or bad, where some solutions are better
than others and some maybe be considered a good enough solution to the problem.
Thus, as many designers tackle wicked problems, they may use an iterative design
process to explore several solutions to find a better or good enough solution.

There are four basic activities in a design process: establishing requirements, design-
ing alternatives, prototyping, and evaluating [36]. Iterative design is the process by
which a design is refined by user feedback through the repetition of these four design
activities. The iterative design process has been visualized as a design funnel, where
at the start of the process, designers begin at the wide end of the funnel and explore
a broad number of potential design solutions [7]. As designers progress through the
design process, they move towards the narrow end of the design funnel, reducing the
number of possible design solutions and ultimately arriving upon a design solution
[7].

In an iterative design process, each iteration is a step toward narrowing the design
funnel. However, each iteration in itself is not narrowing, or reducing [7]. In fact,

21


4. Methods

Figure 4.1: Design funnel as described by Bill Buxton where dashed lines indicate
divergence and solid lines indicate convergence in the design process [7]

each iteration is a combination of divergent and convergent thinking where the
divergence comes from the generation of new ideas and improvements to a design
and convergence is the reduction of those solutions into an iteration or prototype of
the design [7]. With respect to wicked problems, each iteration adds knowledge and
is an attempt to define and solve the problem.

4.2 Literature Reviews

Literature review is conducted by researching and reviewing research literature rel-
evant to the field of study [27]. The purpose of a literature review is to gather
knowledge from previous research or findings to guide new research within a related
field [27]. A literature review can vary in its result, from establishing a theoreti-
cal framework for discussing previous and future research to practical information,
such as guidelines for designing for a specific context. Literature reviews enable
researchers and designers to make connections and cross-references between several
literature sources in order to understand the larger context behind their own work
as well as how their own research can provide new knowledge.

4.3 Summative and Formative Evaluation

Evaluative testing can be divided into two types: evaluative and summative [33]. A
summative evaluation is focused on evaluating the quality of a system or a product
[33]. It is typically suitable in the end of a design process, evaluating a finished

22


4. Methods

system, but also when two alternatives are available or when market competitors
are analyzed. Summative evaluations tend to be focused on measuring quantitative
data [33]. A formative evaluation is focused providing input to improve a system
of a product [33]. It is typically done in an iterative design process, driving the
design forward and motivating design choices and improvements [33]. Formative
evaluations are more focused on providing qualitative input [33].

4.4 Field and Lab Testing

There are several different possible approaches to testing the voice assistants in cars.
For this project, the considered options are: in a car simulator, in a real car on a test
track or in a real car on real roads. There are specific pros and cons of each method
but the contextual aspects of sitting in a real car are weighted as being especially
important. Simulations have the great benefit of being a completely controlled
environment where the scenario can be completely consistent between tests. A large
disadvantage of using simulation is that the participants never feels the sense of real
danger as a consequence of their driving, this might lead to the driver adapting a
more reckless driving style than their usual, affecting the overall outcome of the test
[8].

Doing testing in a real car while driving on actual roads with traffic has the benefit
of providing real, contextual information and performance shaping factors but at
the same time, the environment is completely uncontrollable. Traffic situations,
weather, red and green lights are all factors that would be completely random.
Knowing exactly how these factors affect the results is very difficult.

Conducting tests in a real car on a closed off controlled test circuit allows for some of
the benefits of both previously mentioned methods. The environment can be better
controlled. Real traffic situations can be mimicked and since the participants are
driving real cars, the sense of consequence and danger is there, forcing the driver to
always pay close attention to their driving. Weather still is an uncontrollable factor.

4.5 A/B Testing

A/B testing means testing of two different version of a design so that results can be
compared and it can be determined which one who performs better [27]. The A/B
testing method is not qualitative, the two versions A and B are only measured by
how much they fill a certain quantitative criteria. An example could be two versions
of a voice assistant where the time to complete a specific task is measured. The
A/B test would in the case of the example only result in knowledge about which
one is faster, not why it is faster [27]. In order to cope with this lack of qualitative

23


4. Methods

data, it is recommended that it is combined with other, qualitative methods.

4.6 Interviews

Conducting interviews is a method for design research that allows direct interaction
with users and allows researchers to take part and explore the user’s personal views,
experiences and perceptions about a subject [27]. Interviews are best done in person
so that the researcher may collect information in the form of body language and
facial expressions as well as what is actually said by the user [27].

Interviews can be structured or unstructured. Structured means that all questions
are planned in advance and unstructured has the questions made up as the interview
is active [27]. There are combinations where topics and some base questions are
formed in advance but the the interviewer is allowed to ask new unplanned questions
if he or she wishes, this is sometimes called semi-structured interview.

Interviews is a very flexible method and allows customization and tweaking for
specific uses. Interviews can be done in groups or individually and it can be focused
on attaining information from specific roles or user groups [27].

4.7 Cognitive Workload Measuring

This section covers details and differences of four different methods that have been
developed for the purpose of measuring a subject’s cognitive workload.

4.7.1 NASA-Task Load Index

The NASA-Task Load Index (NASA-TLX) is a rating based measurement method
for assessing the subjective experience of workload during activities [17]. The
method divides the workload into several specific workload sources which allows
specific sources of workload to a specific task to be identified [17].

The method has two steps, first a set of rating scales, then pairwise comparisons.
The first step consists of rating all possible sources of workload on a 20-point scale
representing 0 to 100 in steps of 5. The different sources of mental can be seen in
table 4.1.

24


4. Methods

Table 4.1: The NASA-TLX measurement factors and their descriptions [17]

Title Endpoints Description

Mental De-
mand Low/High

How much mental and perceptual
activity was required (e.g., think-
ing, deciding, calculating, remem-
bering, looking, searching, etc.)?
Was the task easy or demanding,
simple or complex, exacting or for-
giving?

Physical
Demand Low/High

How much physical activity was
required (e.g.. pushing, pulling,
turning, controlling, activating,
etc.)? Was the task easy or de-
manding, slow or brisk, slack or
strenuous, restful or laborious?

Temporal
Demand Low/High

How much time pressure did you
feel due to the rate or pace at
which the tasks or task elements
occurred? Was the pace slow and
leisurely or rapid and frantic?

Performance Low/High

How successful do you think you
were in accomplishing the goals of
the task set by the experimenter
(or yourself)? How satisfied were
you with your performance in ac-
complishing these goals?

Effort Low/High

How hard did you have to work
(mentally and physically) to ac-
complish your level of perfor-
mance?

Frustration
Level Low/High

How insecure, discouraged, irri-
tated, stressed and annoyed versus
secure, gratified, con- tent, relaxed
and complacent did you feel during
the task?

The second part of the NASA-TLX is a weighting process to be able to weight
the ratings in accordance to how much the influenced the task. Possible pairwise
combinations of the sources of workload are compared and the user gets to choose
which one out of the two influenced the task more than the other. This leads to a
weighted rating for each one of the sources of workload and the total weighted task
load score is calculated through the average value of the weighted scores.

25


4. Methods

4.7.2 The Driving Activity Load Index

The Driving Activity Load Index (DALI) is a subjective evaluation method for eval-
uating the cognitive workload of car drivers [35]. The method is largely based on the
NASA-TLX but is revised to more carefully evaluate aspects that are specifically rel-
evant to driving, ruling out aspects like e.g. physical demands [35]. A complete list
of all the measurements factors of the DALI method and corresponding descriptions
can be seen in table 4.2.

Table 4.2: The DALI measurement factors and their descriptions [35]

Title Endpoints Description

Effort of
Attention Low/High

To evaluate the attention required
by the activity – to think about, to
decide, to choose, to look for and
so on.

Visual De-
mand Low/High To evaluate the visual demand nec-

essary for the activity.
Auditory
Demand Low/High To evaluate the auditory demand

necessary for the activity.

Temporal
Demand Low/High

To evaluate the specific constraint
owing to timing demand when run-
ning the activity.

Interference Low/High

To evaluate the possible distur-
bance when running the driving
activity simultaneously with any
other supplementary task such as
phoning, using systems or radio
and so on.

Situational
Stress Low/High

To evaluate the level of con-
straints/stress while conducting
the activity such as fatigue, inse-
cure feeling, irritation, discourage-
ment and so on.

The method is used after a user has performed a task or a set of tasks related to
driving. The user ranks each of the measurement factors on how big of an impact
they had on the task on a two point scale, ranging from very low to very high. The
measurement factors are then weighted in relation to each other, the user is shown
two factors at a time and chooses the one of the two which had the most impact.
This is repeated with factors until all possible combinations has been shown. The
total number of times a certain factor has been chosen as the most impactful is its
weight number. The original rank score, that the user filled out, is multiplied with
its corresponding weight to produce that aspect’s adjusted score. The sum of all
weighted scores divided by 15 represent the weighted rating for the whole task

26


4. Methods

4.7.3 Subjective Workload Assessment Technique

The Subjective Workload Assessment Technique (SWAT) is a scaling procedure
that allows test participant to put number on their subjective experience of mental
workload during a task [38]. It was originally developed for the U.S. Air Force to
be used to assess their pilots mental workload [38]. The SWAT method measures
the workload in three different dimensions, these are Time Load, Mental Effort Load
and Psychological Stress Load [38]. These three dimensions are combined to give a
measure of the total workload of a task.

The SWAT method divides the three previously named dimensions into three differ-
ent levels. Where one would indicate a low level while three indicates the highest,
e.g. a time load rating of one would indicate low levels of time load, where the user
has a lot of time to perform the task while a rating of three would indicate very high
level of time load where the user has no spare time and has to deal with overlapping
activities.

The first step of the SWAT methods is a card sorting process. Cards representing
all different combinations of levels for each of the three dimensions are to be sorted
and ordered from the combination that represent the lowest workload to the highest.
The lowest would logically be a rating of 1, 1 and 1 for time load, mental effort load
and psychological stress load respectively while the highest would be 3, 3 and 3.
The steps in between would typically vary with users and tasks. The user would
then perform a task and rate it on the three dimensions of workload. By seeing
where this rating places in the order of the sorted cards, a weighted workload score
ranging from 0 to 100 can be calculated.

4.7.4 Rating Scale Mental Effort

The Rating Scale Mental Effort (RSME) methods is a simple, one dimensional sub-
jective scale method for measuring mental effort required for a task [49]. It is more
simple than a lot of other mental workload measuring methods due to the fact that
it only requires the user to answer one single scale question.

The scale is made up by a 15 cm long line with every 1 cm indicated. The line is
accompanied by verbal descriptors of the level of mental effort, examples are "almost
no effort" and "extreme effort". The position of the verbal descriptors along the scale
has been carefully adjusted after many user tests during the initial development of
the RSME method [49].

In comparison to other mental workload measurement methods the RSME lacks
some of the more complex aspects that make up the total workload, it does not
consider different dimension of mental workload like NASA-TLX, DALI and SWAT
does [45].

27


4. Methods

4.8 System Usability Scale

The System Usability Scale (SUS) is a simple usability scale for subjective assessment
of a system’s usability [6]. The SUS is made to be quick an allow users to very quickly
convey their experienced usability of a system they have just used.

The SUS is a likert scale and it utilizes ten 5-point scales ranging from strongly dis-
agree to strongly agree. The SUS contains scales covering topics like the complexity
of the system, integration of functions and whether it was cumbersome to use etc.
A full example of the SUS including all scales can be seen in Figure 4.2.

Figure 4.2: A full example of a SUS [6]

The user’s inputted values on the scale goes through a calculation process where the
scores are converted to lower values if they indicate bad usability or higher values

28


4. Methods

if they indicate good usability. This is done by simply subtracting 1 from all even
question and subtracting the score of all even questions from 5. After summarizing
and multiplying with 2.5, a final SUS score between 0 and 100 emerges.

4.9 Subjective Assessment of Speech System In-
terfaces

The Subjective Assessment of Speech System Interfaces (SASSI) method, is a Likert
scale based questionnaire for subjective evaluation of speech system interfaces [18].
The SASSI consists of 34 different scales related to the user’s experience with the
speech interface [24]. Each scale is a seven point Likert scale. The scales are divided
into six different topics: System Response Accuracy, Likeability, Cognitive Demand,
Annoyance, Habitability and Speed [24]. A full SASSI questionnaire can be seen in
Figure 4.3.

4.10 Eye Tracking

Eye tracking is the process of measuring the eye movements in relation to different
points of fixations in the world. Eye tracking can be used as a measure of visual
attention. There are two types of eye tracking: automated and manual.

Automated eye tracking refers to all eye tracking technology that automatically
record and translate eye movements into data. Models specifically suitable for in-
vehicle eye tracking are remote eye trackers, which do not require the user’s head
to be locked in position. Makers of popular remote eye trackers include Tobii,
EyeTribe, and SMI [32]. While automated eye trackers may benefit from the ad-
vantages of technology, such as increased precision, they have their limitations, such
as a smaller area of focus. Automated eye trackers may also experience issues with
inconsistencies, accuracy, and precision of the collected data, which may require
manual review of the collected data [32]. Moreover, these eye trackers may demand
consistent lighting conditions and additional configuration, sometimes for each user
[32].

Manual eye tracking refers to tracking and measuring user visual attention through
the visual analysis of video recordings. This is done by having a researcher manually
code or annotate segments of a video recording according to relevant glance areas,
usually with the aid of some annotating software. With respect to driver attention,
relevant glance areas may include parts of the road or areas of the vehicle’s interior.
The precision of this method is lower compared to automated eye tracking, but
allows for examining larger areas of glance interest. Moreover, manual eye tracking

29


4. Methods

does not require advanced camera equipment or configuration. Cameras used in
manual eye tracking should have sufficient video quality to see the user’s eyes under
all expected light conditions. Depending on the desired time-precision of the eye
tracking analysis, different frequencies of video capture may be considered.

Automated and manual eye tracking each have their advantages and disadvantages.
Automated eye tracking is most suitable for situations where the glance area is
relatively small and requires high precision. For example, when examining areas of
interest in the driver information module (DIM), the area behind the steering wheel.
For larger areas of glance interest, such as multiple areas within a vehicle, manual
eye tracking may be more suitable. Manual eye tracking also requires less setup,
but requires additional labor to manually code eye glances in the video footage.

4.11 Affinity Diagramming

Affinity diagramming is a method used for analyzing and structuring results from
research [27]. The results are structured so that themes emerge allowing designers to
better understand and categorize data, this ultimately leads to a good understanding
of major problems or other important details [27].

The method is conducted by first letting all participant start writing down all rele-
vant details gathered through research on notes. Each participants may have their
own unique color on their notes to make them easier to distinguish. The notes are
all put on a wall and the participants can then start moving them trying to group
them into relevant groups and come up with group titles and even subgroup titles
if they feel the need.

A popular method for making affinity diagrams is the KJ method [27]. The KJ
method is done in a similar way as the above written description but with a big
emphasis that talking is not allowed while writing, placing and organizing the sticky
notes. No speaking allows all participants to minimize any possible influence of
group pressure [27].

4.12 Wizard of Oz

The Wizard of Oz (WOz) technique is performed by simulating a working prototype
or system by letting a researcher or a "wizard" operate and control the prototype
from behind the scenes [27]. Developing a fully working prototype is time and
resource intensive. The WOz technique allows researchers and designers to evaluate
a design concept without having to spend as much resources as building a fully
functional prototype would have demanded [27].

30


4. Methods

From the user’s or test participant’s perspective, WOz prototypes and implemented
features are indistinguishable. This is achieved by preparing system responses for
potential paths of interaction in advance, so that the prototype operator, or wizard,
can quickly respond to user input. For WOz prototypes to be successful, the pro-
totype operator must be able to see or hear the user so that appropriate responses
can be provided based on user input. Moreover, users should be unaware that the
prototype operator is controlling the WOz prototype.

The WOz technique has a long history with the development of speech recognition
and voice user interfaces [16]. WOz prototypes can be used throughout the design
process of voice interfaces and is invaluable for resource for understanding users’ vo-
cabulary, utterance structures, and interactive patterns [16]. While WOz prototypes
allow voice interface designers to bypass developing speech recognition systems to
evaluation a design, the value of designed errors is not to be discounted. In fact,
there are tools for creating WOz prototypes that randomly assign speech recognition
errors to understand user reactions to such scenarios [23].

31


4. Methods

Figure 4.3: The SASSI questionnaire [24]

32


4. Methods

Figure 4.4: Affinity diagram (partial) used to analyze qualitative data

33


4. Methods

34


5
Process

This chapter describes the research and design process carried out as part of this
thesis work. Several methods were carried out as part throughout the process,
and the specifics of those methods with respect to the purpose of this project are
discussed here. For details about the methods themselves, see Chapter 4: Methods.

5.1 Pre-study and Preparation

The pre-study phase was the first phase of the research project and focused on a
review of related research. This pre-study was done to understand what research
has already been done and what gaps in the research exist which this project could
aim to answer.

In addition to developing a contextual understanding of the research area, the pre-
study helped to identify methods and test setups that are frequently used when ex-
amining distracted driving in terms of visual distraction and cognitive load. Methods
for data analysis and theories related to attention and cognition were also identified.
The knowledge gathered during the pre-study phase was used to plan the project
execution.

5.2 Project Planning

The project planning phase was focused on developing a schedule for the execution
of the project. The distribution of time between the pre-study and preparation,
project execution, and project finalization phases were based on recommendations
from thesis examiners at Chalmers University of Technology and spread over a 20
week time period. During the project planning phase, methods were selected for
their suitability to the research question developed in the pre-study phase. A full
GANTT schedule of the project process with calender week numbers can be seen in
Appendix A.

35


5. Process

5.3 Literature Review of Existing Guidelines

During the pre-study phase, a brief review of three existing design guidelines was
done to gain a general understanding of what each set of guidelines covered with
respect to voice assistant interaction in vehicles. The guidelines reviewed in the pre-
study stage were Android Auto Design Guidelines, Apple CarPlay Human Interface
Guidelines, and Google Conversation Design [2, 1, 14, 13]. These guidelines were
selected for review since they are directly tied to the two commercially available
integration interfaces.

To answer the first sub-question of the research question and understand what cur-
rent guidelines exist for voice assistant interaction in vehicles, a more in-depth lit-
erature review of the existing guidelines was required. This literature review aimed
to summarize and understand the collective wisdom of the industry when it comes
to in-vehicle voice assistant interaction. In addition to the guidelines once reviewed
during the pre-study phase, this literature review also included Amazon Alexa De-
sign Guide [1]. Although Amazon does not have an integration interface on the
market, it has announced plans to do so in the coming years. A review of existing
NHTSA guidelines was also done, as those guidelines specifically deal with traffic
safety [30, 31].

The results of this literature review would be used in later phases to identify estab-
lished guidelines which work well to decrease visual distraction and cognitive load.
The review was also used to identify areas where the guidelines were not followed
by existing voice assistants and to identify gaps in the guidelines with respect to
distracted driving and voice assistant interaction.

5.4 Summative Evaluation

In order to understand the efficacy of the existing guidelines, a summative evaluation
of existing voice assistants in vehicles was conducted. The summative evaluation
also served to identify any issues in the voice assistant integrated IVI that may
contribute to visual distraction and increased cognitive load, thereby decreasing safe
driving. The evaluation consisted of three parts: an on-road test, data collection
and handling, and analysis of data collected from the test.

A total of 8 test participants completed the on-road test. A ninth participant began
an on-road test, but the test was ended prematurely due to concerns for traffic safety.

The 8 test participants had been licensed drivers for a mean time of 12.6 years.
Frequency of driving was evenly spread out among participants between driving
every day to less than once a month. Only two participants had previous experience

36


5. Process

with driving with PA, the Level 2 ADS used in the test. All but one participant had
previous experience with VAs and the large majority of these previous experience
were with VAs on smartphones, a screen-first solution.

5.4.1 On-road Test Setup

The on-road test was done on public roads in Torslanda, Gothenburg. Test par-
ticipants drove along a predefined route that measured 10.3 kilometers with round-
abouts at each end which made for a continuous driving experience. Speed limits
along the route varied between 50 and 70 kilometers per hour.

Figure 5.1: Interior of the test car model, equipped with Android Auto and Apple
CarPlay

Test participants were recruited internally at Volvo but were not limited to employ-
ees. Students and consultants placed at Volvo were also invited. Due to liability
issues with the test car, only employees, students, and consultants with Volvo access
could participate. The test car used was a Volvo V90 with automatic transmission
and Pilot Assist (PA), a lane keeping and adaptive cruise control feature which
makes it a SAE Level 2 autonomous vehicle. The implementation of the PA feature
on the test car is common to other Level 2 vehicles. The V90 test car was also
equipped with both Android Auto and Apple CarPlay.

The on-road test was designed to compare the performance of the two voice assis-
tants, Apple Siri and Google Assistant. The test was also designed to determine if
there was a decrease in visual distraction and cognitive load when test participants
were aided by Pilot Assist. Thus, there were four conditions for test participants to

37


5. Process

complete:

• Android Auto with manual driving

• Android Auto with Pilot Assist

• Apple CarPlay with manual driving

• Apple CarPlay with Pilot Assist

Under each condition, test participants were asked to perform 9 secondary tasks
while driving. These tasks were selected due to their relation to categories of app
enabled on integration interfaces. Moreover, functionality for all tasks exist on both
voice assistants tested. The tasks were:

1. Open a new received text message

2. Send text message to a contact

3. Make a call to a contact

4. Make a call to a contact with multiple phone numbers

5. Play a genre of music

6. Play a specific song by a specific artist

7. Start navigation to a street address

8. Add a café to the current route

9. Start navigation to the nearest McDonald’s

For each test, participants began by signing a consent form for their data to be
collected and used for this project’s research. Next, they completed a survey about
their previous experience with driving and using voice assistants. On the drive from
Volvo Headquarters to the designated test route, test participants were trained on
using PA and had a chance to get familiar with driving the car, with and without
PA. Then, the test participants performed the 9 secondary tasks while driving along
the test route for each of the four test conditions. The order of test conditions was
randomized to minimize any bias from the order the test conditions were completed.
Prior to starting each test condition, test participants were given training on the
voice assistant for each condition in relation to the types of tasks they would be
asked to perform. The order of the 9 tasks was not randomized, since some tasks
built upon the output of a previous task and the overall difference in voice assistants
was the focus of each test condition. After completing each driving condition, test

38


5. Process

participants were asked to complete a DALI survey to assess the cognitive load
of each test condition. After the test, participants were asked a set of follow-up
questions about their overall experience in a semi-structured interview.

For each task, participants were permitted up to 3 attempts in the case of task
failure. Task failure is defined as the end of an interaction with the VA that does
not trigger the desired intent or action. Task success is defined as the successful
completion of a task using the voice assistant. For example, an utterance for Task 7
which results in navigation to the wrong address would be considered a task failure.
Test participants were not required to make repeated attempts in the case of task
failure.

The survey about the participant’s previous experience can be found in Appendix
B. A complete protocol of the on-road test can be found in Appendix C. The test
schedule and the randomized condition permutations can be found in Appendix D.

5.4.2 Data Collection and Handling

Visual distraction during the on-road test was measured by manual eye tracking,
which is further described in section 4.10. The primary focus of this data was to
distinguish between on-road and off-road glances. Moreover, cognitive load was
assessed using DALI surveys completed during the on-road test. The DALI survey
used can be found in Appendix I.

Eye glance data was collected during the on-road test via three video cameras
mounted throughout the car. The cameras recorded a view of the driver’s face,
the IVI display, and the road. The three views of the cameras can be seen in Figure
5.2. Audio was included in the video recordings. These videos were then synchro-
nized for each condition. The synchronized videos made it possible to code the
eye glances of the test participant and understand that context of glances with the
added road and IVI views.

The synchronized video was then manually coded through a custom tool created in
Matlab, one task at a time. The software used is seen in Figure 5.2. Task eye glance
analysis began from the end the test facilitator’s prompt to complete the task to
task success or the end of the last attempt to complete the task. Thus, task footage
analyzed may include more than one attempt to complete a task. The tool allowed
the video to be analyzed at a 30 Hz frequency. The tool made it possible to assign a
glance code to each frame analyzed. Once the glance codes were assigned, duration
for each glance was calculated in preparation for data analysis.

Several codes were used to annotate the eye glance data. These codes were:

0. On-road Glances on the road

39


5. Process

Figure 5.2: Eye tracking software and video with three camera views

1. IVI VA Inactive Glances at the IVI when the VA is not active

2. IVI VA Active Glances at the IVI when the VA is actively awaiting a driver
utterance

3. IVI VA Processing Glances at the IVI when the VA is processing an utter-
ance

4. IVI VA Response Glances at the IVI when the VA is presenting a response
or prompt

5. DIM PA Off Glances at the driver information module (DIM) when PA is
off

6. DIM PA On Glances at the DIM when PA is on

7. Miscellaneous Glances that are directed at the road, IVI, or DIM

DALI data was collected for each test participant, for each test condition, totally
4 completed DALIs per test participant. Adjusted ratings from each DALI were
calculated for the individual dimensions of the DALI. Combined, the adjusted ratings
resulted in a weighted rating also used in later data analysis. DALI scores were
weighted according to the established protocol described in Chapter 4.

In addition to the quantitative data collected above, qualitative observational data
was also collected. The synchronized videos were reviewed and qualitative observa-
tions, such as emotional reactions and scenarios of high frustration, were recorded.
Test participant answers from the debrief interview were also transcribed for quali-
tative analysis.

40


5. Process

5.4.3 Data Analysis

The data analysis was done in two parts: qualitative and quantitative. The qual-
itative analysis deals with observational notes of the on-road test and transcribed
answers from the debrief interview. The quantitative analysis concerns the eye
glance and DALI data.

Qualitative data from observation notes and interview transcriptions were combined
and analyzed using the affinity diagramming method. This allowed for connections
between different data points and recurring themes in the data to be identified.
The insights from this analysis would later guide the development of new design
guidelines and actualizations of these guidelines as prototypes.

Figure 5.3: Affinity diagram (partial) of qualitative summative evaluation data

The quantitative data was analyzed using Minitab, a statistical data analysis soft-
ware. Eye glance and DALI data was plotted in order to identify any trends in
the data. The plots where also used to help determine whether the different VAs
had a discernible difference on visual distraction and cognitive load. The plots were
also used to determine if the use of SAE Level 2 ADS, in this case PA, also had
an effect on visual distraction or cognitive load. The results from the quantitative
data analysis were also used to support and motivate new design guidelines for voice
assistant interaction in vehicles.

41


5. Process

5.5 Prototype Development and Evaluation

Following the summative evaluation, ideation for improvement to address the issues
identified in the summative evaluation began. The ideation process resulted in
two prototypes, Prototype 1 and 2. These prototypes embodied interstitial, new
guidelines for voice assistant interaction in vehicles. These prototypes were then
tested in a simulator. The results from the simulator test were then collected and
analyzed.

Both prototypes were tested by 10 participants, but eye glance data is only available
for 9 participants due to file corruption. The participants has been licensed drivers
for a mean time of 9.1 years. Participants were mostly infrequent drivers, driving
every other week or less. Three test participants drove at least once a week. All but
two participants had previous experience with VAs. The majority of participants
had previously experienced VAs on smartphones.

5.5.1 Ideation and Prototype Development

The aim of the ideation process was to come up with potential solutions to improve
the problems in the existing guidelines and voice assistants. The ideation process
began by narrowing the number of tasks that would be performed by the test par-
ticipant during the simulation test. Tasks from the on-road test were recycled, but
tasks which had little interaction and little data results, were removed, such as the
task of calling a contact by name. A voice-first approach was taken, the ideation
process focused first on generating many conversation dialogs for the test tasks.
Eventually, two concepts emerged, which will be labeled as Prototype 1 and Proto-
type 2. Conversation dialogs for both prototype were further refined to reflect the
two concepts. This includes having multi-turn and one-shot dialogs for applicable
tasks. Moreover, since errors were found to be a factor in visual distraction and cog-
nitive in the summative evaluation, error handling was a key re-designed element in
both prototypes.

Both prototypes were then implemented in Adobe XD, as shown in Figure 5.4, which
allows designers to assign a pre-written system response for each screen. Adobe XD
is able to output both visual information as well as speech. The prototypes were
developed with high-fidelity graphics for the purposes of using the Wizard of Oz
(WOz) technique when testing the prototypes. The two prototypes shared a visual
language, in order to focus on discerning any differences or preferences between to
two voice interaction concepts developed.

In order to later use the WOz technique with the prototypes, a control panel was
designed for both prototypes. The control panel allows the wizard to control the
flow of system responses to test participant input. During the simulator test, the

42


5. Process

Figure 5.4: Prototype development in Adobe XD

control panel would be hidden from the test participant.

5.5.2 Simulator Test Setup

The two prototypes were tested in a truck simulator located at Chalmers Johan-
neberg campus. Test participants drove along a highway, following in-game navi-
gation directions in the trucking-driving simulation game Euro Truck Simulator 2.
The game included multiple lanes of traffic, which participants were free to switch
between while avoiding collision with any of the other in-game vehicles.

Test participants were recruited through an online survey. Participants must have
a valid driver’s license to participate. Upon completing the test, participants were
compensated with a gift card for 250 kronor.

The simulator setup included a large TV display positioned in front of the driver’s
seat. The seat was a full adjustable car seat, which helped to acclimate experienced
drivers to use the simulator. The simulator was also equipped with steering wheel,
gear shift, and pedal game controls to drive the truck. The setup can be seen in
Figure 5.5. The steering wheel was equipped with some force feedback to simulate
bumps in the road and the simulator was set to an automatic transmission. The
simulator was not equipped with any autonomous driver features.

A Windows Surface Book was mounted to the right of the steering wheel to simulate
the IVI. The simulated IVI displayed the two tested prototypes, one at a time, and

43


5. Process

Figure 5.5: Simulator test setup with a dividing wall between the test participant
and wizard (not to scale)

was connected to the wizard computer by remote desktop. This allowed the wizard
to control the prototype’s responses to test participant input on the fly. The wizard
was situated in the same room as the test participant and test facilitator, but behind
a partition so participants were not aware that the wizard was controlling the IVI
prototype. Only one person acted as the wizard to minimize systematic bias. As
shown in Figure 5.5, the control panel of the IVI prototype was hidden from the test
participant but visible to the wizard. This control panel on the prototypes allowed
the wizard to remotely control the prototypes in real time, in dir