Design and implementation of a fault-tolerant
drive-by-wire system

Master of Science Thesis in Embedded Electronics System Design

Alexander Altby

Davor Majdandzic

Department of Computer Science and Engineering
Chalmers University of Technology
Gothenburg, Sweden 2014

1


The Authors grants to Chalmers University of Technology the non-exclusive right to publish
the Work electronically in a non-commercial purpose making it accessible on the Internet. The
Author warrants that he/she is the author to the Work, and warrants that the Work does not
contain text, pictures or other material that violates copyright law.

The Author shall, when transferring the rights of the Work to a third party (for example a
publisher or a company), acknowledge the third party about this agreement. If the Author has
signed a copyright agreement with a third party regarding the Work, the Author warrants
hereby he/she has obtained any necessary permission from this third party to let Chalmers
University of Technology store the Work electronically and make it accessible on the
Internet.

Alexander Altby, Davor Majdandzic.

c©Alexander Altby, June 2014.

c©Davor Majdandzic, June 2014.

Examiner: Johan Karlsson.

Chalmers University of Technology
Department of Computer Science and Engineering
SE-412 96 Gothenburg
Sweden
Telephone + 46(0)31-772 1000

Department of Computer Science and Engineering
Gothenburg, Sweden June 2014


ABSTRACT

Abstract

This thesis presents the design and implementation of a prototype for a drive-by-wire
system in road vehicles. The prototype extends an existing (non-fault-tolerant) prototype
with fault tolerance by implementing distributed brake system and dual modular redundancy
for a central control unit. The steering is made redundant by utilizing the distributed
brakes and using the braking capability on either side of the car. This will cause the car to
turn in corresponding direction, i.e., steer-by-brakes. A hardware monitor is designed and
implemented in the redundant central control units. The purpose of the hardware monitor
is to restart the control unit in case of a failure.

The non–fault-tolerant prototype is being used as a reference design when analysing the
reliability and safety of the fault tolerant design. An analysis is made to verify the lowest
failure rate that the design must tolerate in order to meet a target reliability of 99.999%
after 10 years. The thesis follows the guidelines of the standard for functional safety in road
vehicles, ISO 26262.

i


ABSTRACT

ii


ACKNOWLEDGMENTS

Acknowledgements

Special thanks to Behrooz Sangchoolie, PhD student in the fault tolerance area at Chalmers
University, for the technical support, helping with the report, giving continuous feedback and
providing his technical expertise in the fault tolerant area.

Also, thanks to Johan Karlsson, professor in dependable Real-Time Systems at Chalmers Uni-
versity, for assisting with his technical expertise in the fault tolerant area and being the examiner
for this thesis.

We would also like to thanks the attendants of the reading group regarding the ISO 26262;
Behrooz Sangchoolie, Pierre Kleberger, Fatemeh Ayatolahi and Aljoscha Lautenbach. The learn-
ing and insight of your technical expertise has raised our interest in the area of fault tolerance.
You have been an influence and large part of the outcome of this thesis.

Last but not least, thanks to David Rydén at Sigma technology for accepting the proposed thesis
and providing the industrial expertise in the automotive area. Also, thanks to Alexandra Angerd
at Sigma technology for giving us assistance and introduction of the previous developed system.

Alexander Altby & Davor Majdandzic, Gothenburg June 2014.

iii


ACKNOWLEDGMENTS

iv


ABBREVIATIONS

ABBREVIATIONS

ADC Analog to Digital Converter

ABS Anti-lock Brake System

ASIC Application-Specific Integrated Circuit

ASIL Automotive Safety Integrity Level

CAN Controller Area Network

CCU Central Control Unit

CPLD Complex Programmable Logic Device

CPU Central Processing Unit

GPIO General-Purpose Input/Output

E/E Electrical/Electronic

ECU Electrical Control Unit

EMI ElectroMagnetic Interference

ESP Electronic Stability Program

FFA Functional Failure Analysis

HARA Hazard Analysis and Risk Assessment

HW Hardware

JTAG Join Test Action Group

LED Light Emitting Diode

LCD Liquid Crystal Display

ODE Ordinary Differential Equation

SIL Safety Integrity Level

SPI Serial Peripheral Interface

SW Software

SWIFI Software Implemented Fault Injection

v


ABBREVIATIONS

vi


CONTENTS

Contents

1 Introduction 1

2 Existing prototype 3

3 Technical background 5
3.1 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Faults, errors and hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Degree of operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Integrity level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 High Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Single-point and multiple-point faults . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7 Techniques for fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.8 Fault injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.9 Architecture designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.9.1 Local . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.9.2 Centralized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.9.3 Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Concept phase 13
4.1 Item definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Functional concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Usage in different environments . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Hazard analysis and risk assessment . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Hazard identification and classification . . . . . . . . . . . . . . . . . . . 17
4.2.2 Safety goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Functional safety concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Modifications to the reference design . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4.1 Distributed brake system . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.2 Steer-by-brakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.3 Impact analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Product development 23
5.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2.1 Duplex-modular redundancy for central control unit . . . . . . . . . . . . 24
5.2.2 Error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.3 Reset handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Software design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Safety validation 29
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 System reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.1 Central control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2.2 Brake-and-steer system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.3 System safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Implementation 47
7.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.1.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.2 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.1.3 Recapture of steering value . . . . . . . . . . . . . . . . . . . . . . . . . . 50

vii


CONTENTS

7.1.4 Determination of a failed central control unit . . . . . . . . . . . . . . . . 51
7.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2.1 Primary and backup central control unit . . . . . . . . . . . . . . . . . . . 52
7.2.2 Hardware monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.3 Development platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2.4 Electronic control unit for the distributed-brake . . . . . . . . . . . . . . . 56

8 Timing analysis 57
8.1 Existing prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.2 Fault-tolerant prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

9 Test and verification 65
9.1 Software implemented fault injections . . . . . . . . . . . . . . . . . . . . . . . . 66

9.1.1 Testing the functionality of hardware monitor . . . . . . . . . . . . . . . . 66
9.1.2 Testing the functionality of majority voter . . . . . . . . . . . . . . . . . . 66
9.1.3 Testing the functionality of median voter . . . . . . . . . . . . . . . . . . . 67

9.2 Pin fault injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

10 Discussion 69

11 Conclusion and future work 71

viii


LIST OF FIGURES

List of Figures

2.1 System overview of the non-fault-tolerant drive-by-wire system . . . . . . . . . . 3
2.2 Block diagram of the non-fault-tolerant drive-by-wire system . . . . . . . . . . . . 3
2.3 Block diagram of the reference drive-by-wire system . . . . . . . . . . . . . . . . 4
3.1 Flowchart showing how fault propagates . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Local architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Centralized architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Distributed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Overview of the safety life cycle for the concept phase . . . . . . . . . . . . . . . 13
4.2 Use case of the items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 The interaction between the elements of the items . . . . . . . . . . . . . . . . . . 15
4.4 Safety requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Use case of the modified items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 The modified items’ interaction with the different subsystem . . . . . . . . . . . . 21
5.1 Overview of the safety life cycle for the product development . . . . . . . . . . . 23
5.2 Centralized architecture for the drive-by-wire used in the reference design . . . . 24
5.3 Distributed architecture used in the fault-tolerant design . . . . . . . . . . . . . . 24
5.4 Distributed-reset design for the central control unit . . . . . . . . . . . . . . . . . 26
5.5 Self-reset design for the central control unit using a hardware monitor . . . . . . 26
5.6 Software design of the safety mechanisms in the central control unit . . . . . . . . 27
6.1 Fault tree of the reference system . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Fault tree of the fault-tolerant system . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Markov model of the central control unit for the reference system . . . . . . . . . 31
6.4 Markov model of the central control units for the fault-tolerant system . . . . . . 32
6.5 Reliability for the central control unit of the previous and fault-tolerant system . 34
6.6 Reliability for the fault-tolerant system’s central control unit using ideal coverage 35
6.7 Reliability for the fault-tolerant system’s central control unit using a coverage of

99% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.8 Reliability for the fault-tolerant system’s central control unit using a coverage of

98% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.9 Reliability for the fault-tolerant system’s central control unit using a coverage of

95% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.10 Markov model of the brake-and-steer system for the reference design . . . . . . . 37
6.11 Markov model of the brake-and-steer system for the fault-tolerant design . . . . . 38
6.12 Reliability for the brake-and-steer system of the reference and fault-tolerant system 40
6.13 Intersection points for of the systems . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.14 Reliability of the brake-and-steer system for the fault-tolerant design using ideal

coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.15 Reliability of the brake-and-steer system for the fault-tolerant design using a cov-

erage of 99% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.16 Reliability of the brake-and-steer system for the fault-tolerant design using a cov-

erage of 97% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.17 Markov model with safe state for the throttle system . . . . . . . . . . . . . . . . 44
6.18 Markov model with safe state for the CCU system . . . . . . . . . . . . . . . . . 45
6.19 Markov model with safe state for the brake-and-steer system . . . . . . . . . . . . 46
7.1 Overview of the fault-tolerant prototype . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Flow chart of the tasks in the fault-tolerant prototype . . . . . . . . . . . . . . . 48
7.3 Overview of the functionality for throttle-by-wire and brake-by-wire . . . . . . . 48
7.4 Messages on CAN bus to recapture steering value. . . . . . . . . . . . . . . . . . 51
7.5 Messages on CAN bus to recapture steering value when both CCUs startup at the

same time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ix


LIST OF FIGURES

7.6 Hardware implementation of the fault-tolerant system . . . . . . . . . . . . . . . 52
7.7 Picture of the implemented main central control unit . . . . . . . . . . . . . . . . 53
7.8 Picture of the backup central control unit . . . . . . . . . . . . . . . . . . . . . . 53
7.9 Picture of the hardware monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.10 State diagram of the hardware monitor . . . . . . . . . . . . . . . . . . . . . . . . 54
7.11 State transition of the hardware monitor when a glitch occurs . . . . . . . . . . . 54
7.12 State transition of the hardware monitor when a transient fault occurs . . . . . . 54
7.13 State transition of the hardware monitor when a permanent fault occurs . . . . . 55
7.14 Function of the hardware monitor when time between failures in the central control

unit is 40 ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.15 Function of the hardware monitor when time between failures in the central control

unit is 60 ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.16 Function of the hardware monitor when time between failures in the central control

unit is 80 ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.17 Function of the hardware monitor when time between failures in the central control

unit is 100 ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.18 Picture showing the prototype of the brake ECU . . . . . . . . . . . . . . . . . . 56
8.1 Run time for the steering task in the existing prototype . . . . . . . . . . . . . . 58
8.2 Run time for the throttle task, brake task and CAN task, in the existing prototype 58
8.3 Overview of the run times for all periodic tasks in the existing prototype . . . . . 59
8.4 Overview of the run time for the Brake (or Thottle) task in the fault-tolerant

prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.5 Detailed run time for one sensor reading for the brake and throttle tasks in the

fault-tolerant prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.6 Run time for the CAN_RX (CAN receive) task in the fault-tolerant prototype . 62
8.7 Run time for the CAN_TX (CAN transmit) task in the fault-tolerant prototype 62
8.8 Run time for the watchdog task in the fault-tolerant prototype . . . . . . . . . . 63
8.9 Run time for all tasks, in the fault-tolerant prototype . . . . . . . . . . . . . . . . 64
9.1 Overview of the functional testing of the fault-tolerant prototype . . . . . . . . . 66

x


LIST OF TABLES

List of Tables

3.1 Required failure rate according to the Safety Integrity Level . . . . . . . . . . . . 7
4.1 The items of the drive-by-wire system and description of their functionality . . . 14
4.2 Environmental impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Description of the injury impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Description of the probability of exposure . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Description of the controllability . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Hazard identification and ASIL determination . . . . . . . . . . . . . . . . . . . . 17
4.7 ASIL determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Mechanisms for error detection at software level . . . . . . . . . . . . . . . . . . . 26
6.1 Example values for transient failure rates for different ASIL . . . . . . . . . . . . 33
6.2 Example values for random hardware failures . . . . . . . . . . . . . . . . . . . . 33
7.1 Id for the different CAN messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.1 An overview of the run time for the tasks in the existing prototype . . . . . . . . 59
8.2 Overview of the run time for the tasks in the fault-tolerant prototype . . . . . . . 64

xi


LIST OF TABLES

xii


1 INTRODUCTION

1 Introduction

One of the next big things in vehicle industry is self-driving cars or autonomous cars. An
autonomous car require that actuators that control the motion of the vehicle, can be interacted
with electronically. Therefore a drive-by-wire system is needed. A drive-by-wire system replaces
the mechanical systems in a traditional vehicle by using electrical/electronic (E/E) systems to
perform fundamental vehicle functions.

The drive-by-wire system includes steer-by-wire, brake-by-wire and throttle-by-wire. The "by-
wire" expression means that the information, from the sensor to the actuator of the different
systems, is transferred electronically through wires and not by traditional hydraulic systems or
mechanically through struts or shafts.

The advantage of using drive-by-wire rather than mechanical systems is that reduction of cost,
moving parts and weight can be achieved. Since the steering rack can be removed, the car’s
shock impact, in case of a collision, can be improved. Using an electrical based system will also
increase the information flow and ease up the interconnect between different components in the
car, facilitating the use of safety functions such as; ABS (anti-lock brake system), ESP (electronic
stability programme), etc.

Electromagnetic interference (EMI) [1] and ionizing radiation [2][3] are two examples of error
sources that may introduce failure to an integrated circuit in an E/E system. The systems
that constitute the functionality of the drive-by-wire system must be able to handle the failures
in a predictable way, since losing the control of a car in high velocity may notoriously end in
lethal outcome. Therefore, the drive-by-wire system needs to be as fault tolerant as reasonably
possible.

Safety can be achieved by introducing safety mechanisms in vital components of the drive-by-
wire system. The safety mechanisms will have to target faults that can put the drive-by-wire
system in a state which may have lethal outcome for the driver or other road users. A standard
for functional safety in road vehicles, ISO 26262 [4], was released November 2011 to provide the
vehicle industry with guidelines on how to developing safe-critical applications in a vehicle.

This thesis deals with the development of a fault-tolerant architectural design for a drive-by-wire
system. By following the guidelines of the ISO 26262 standard, this thesis targets possible fail
sources and how to deal with them in the drive-by-wire system. A safety analysis is made to
ensure that the reliability and safety of the designed system is improved.

An implementation of a prototype for a fault-tolerant design is also made to demonstrate the
feasibility of the proposed architecture. The prototype was implemented by extending an existing
non-fault-tolerant prototype developed in the thesis made by Angerd and Johansson2. The design
of the non-fault-tolerant prototype is used as a reference when comparing the reliability and safety
improvements in the system where fault tolerant is designed.

This thesis makes the following contributions:

• Proposition of a fault-tolerant architectural design. The design consists of a distributed
brake system with steer-by-brake functionality and redundancy in vital units.

• Reliability analysis showing how a required reliability can be achieved for the fault-tolerant
design. The analysis proposes a Markov model for the design and compares it to a non-
fault-tolerant design. Further, a safety analysis is proposed showing the steady-state safety
of the fault-tolerant design.

• Realization of a prototype to show the feasibility of the proposed architecture. The pro-
totype shows the key components contributing to the fault tolerance in the architecture.

1


1 INTRODUCTION

The key components are the hardware monitor, the fault detection mechanisms and the
failure avoidance mechanisms.

The content of this thesis is outlined as follows: Section 2 explains the non-fault-tolerant drive-
by-wire system made by earlier thesis students. Section 3 describes the background and theory
of the different techniques used in this thesis. Section 4 is the concept phase in which the items
of the drive-by-wire system are defined. Certain risks and hazards is also identified in section
4. The methods which is to be applied to the items, in order to prevent the risks and hazards
from occurring, is presented in section 5. Section 6 shows that the safety strategies applied to
the fault-tolerant design increases the reliability and safety compared to the reference design.
Section 7 presents how the safety mechanisms and methods is implemented. A timing analysis is
represented in section 8 in order to show how the implemented methods and mechanisms effects
the execution time of the software. In section 9, the safety mechanisms are tested and verified.
Finally a discussion and conclusion is made in section 10 and 11.

2


2 EXISTING PROTOTYPE

2 Existing prototype

The fault-tolerant drive-by-wire prototype is based on an earlier master’s thesis project made
by Angerd and Johansson [5]. The existing prototype consists of a central control unit (CCU),
ECUs for the steering, brake and throttle systems and sensors for the driver input. An overview
of the non-fault-tolerant design is shown in Figure 2.1.

������

��		
�

���������	


��

���������	
���

�����

�����������


��

�������

�������������������


��

��

�
�
�
�
�

�
�
�
�
		

�

���������

��������

��������

�����

���������

��������

Figure 2.1: System overview of the non-fault-tolerant drive-by-wire system. The steering wheel
and pedals for brake and throttle, consists of the system’s sensors.

A block diagram of the reference drive-by-wire system is shown in Figure 2.2. The three sensors
for throttle, brake and the steering are connected to the central control unit (CCU). The CCU is
connected to two electronic control units (ECUs) via a controller area network (CAN) bus, one
ECU for throttle and brake and another one for steering. The two ECUs are connected to the
actuators controlling the rear-wheel drive and steering.

���

������	
�

��	���

��������

��	���

�����

��	���

���

������	
�

���

��������

��������

�����

�������� ������

��������

�������	��

������������

Figure 2.2: Block diagram of the non-fault-tolerant drive-by-wire system made by previous thesis
students.

3


2 EXISTING PROTOTYPE

The CCU is implemented on a Texas Instruments Hercules development board, TMS570LS3137
[6]. It uses an operating system called FreeRTOS [7] which is an open-source real-time kernel.
The sensors for brake and throttle are connected to two AD converters on the development board
of the CCU. The steering wheel is connected to two general-purpose–I/O pins on the development
board. The development board also has a CAN interface which enables for CAN communication
from the CCU to the ECUs.

The ECU for brake/throttle is implemented on an 8-bit Atmega [8]. Since the Atmega does not
have a CAN interface, a separate SPI to CAN circuit is used. A servo and an LED array is used
to represent the brake value. An LED array and an electric motor connected through a H-bridge
to the CPU represents the throttle value. An LCD display is also connected to the CPU which
displays the current brake and throttle values. A similar ECU used for brake/throttle is also
used for steering without the LED arrays, H-bridge and electric motor [5].

To do a fair comparison between the fault-tolerant design and the design of the existing prototype,
a reference drive-by-wire system is defined. The non-fault-tolerant system, hereinafter referred
to as reference system, is shown in Figure 2.3

���

������	
�

��	���

��������

��	���

�����

��	���

���

��������

��������

�����

��������

������

��������

������	
�

���

��������

���

������

���

Figure 2.3: Block diagram of the reference drive-by-wire system with no fault tolerance.

Compared to the previous system, shown in Figure 2.2, the reference system has separated brake
and throttle ECU.

4


3 TECHNICAL BACKGROUND

3 Technical background

This sections explains the concept of fault tolerance. Common attributes that are used to measure
a system’s ability to avoid failures, are stated and explained. The concept of a hazardous event
and how a system can propagate to such a state is further explained. Different techniques of
fault injections in order to test the robustness of a system, are also explained. Definition of, and
difference between, single-point fault and multiple-point faults are described. Different techniques
in order to increase safety, such as; voting, redundancy and fault avoidance and forecasting are
elaborated. Overview of local, central and distributed architecture designs are also elaborated
on and explained in this section.

3.1 Dependability

Dependability is the system’s ability to avoid failures. Dependability is described using four
different attributes according to [9][10]. The following four attributes are:

• Availability - Readiness for usage. The probability that the system will work as intended
at any given time.

• Reliability - Continuity of service without failure. The probability that the system will
work as intended under a specific time.

• Safety - The probability of avoiding a catastrophic event.

• Maintainability - Ability to handle repairs, modifications and updates.

In a drive-by-wire system all of these can be of importance. Since a poor reliability and safety
in the drive-by-wire system can harm the driver, these are the highlighted attributes in this
thesis.

3.2 Faults, errors and hazards

A hazard is the occurrence of an event that puts people in risk of danger [9]. Example of a
hazard is loosing a wheel of the vehicle when driving. This event puts the driver, pedestrians or
other road users at risk of getting injured. A hazard occurs when a fault propagates to an error
that is not covered by safety mechanisms in a system.

A fault is either a random event (e.g. a bit-flip or frozen memory bit) or a systematic event
(e.g. bug in a code) in a system or subsystem. The duration of faults can be categorized into
three groups; transient faults, intermittent faults and permanent faults [9]. Transient faults are
faults that appear and disappears under a short period of time. Example of a transient fault is a
temporary random bit-flip in the memory. Intermittent faults are faults that appears sporadically.
The source of an intermittent fault is therefore usually hard to discover. Intermittent faults can
be caused by cold soldering or loose connectors. Permanent faults are the kind of faults that
when they occur, they remain indefinitely [9]. Electromigration [11] and design faults are two
examples of permanent fault sources.

An error is when the fault propagates and changes the output of the system in such way that it
behaves in an unwanted manner [12]. An example could be when a bit flip occurs in a memory
cell that stores the current steering value. If there is no safety mechanism to detect that fault,
the fault will result in an error when the steering value is sent to the electronic control unit
(ECU) for steering.

5


3 TECHNICAL BACKGROUND

A fault can either be latent or active [12]. When latent, the fault does not directly alter the
systems behaviour in such way that it causes an error. A latent fault may however, in combination
with another fault, propagate to an error. See multiple-point faults, section 3.6. When a fault is
active, the system is altered from its correct behaviour and creates an error. If the system can
not handle the error, i.e., an un-covered fault, system failure will occur. The system failure can
either propagate to a fail-safe state or a critical failure which is interpreted as a hazard. In some
cases the fail-safe state can be achieved by turning off the system. When a system failure occurs
in a subsystem, the failure may propagate to higher system levels and cause faults. This may
result in failure of the whole system. [12]

Figure 3.1 shows a flowchart of how a system is exposed to a fault, how the fault can propagate
to an error and how the result of this propagation can result in a system failure. The system
failures; fail safe and critical failure, are further elaborated in section 3.3.

����������	� 
��������	�

���������

������

����������

������ ��������	
���

���������

	
�������

����
�

�������

�

�


�������

��

Figure 3.1: Flowchart showing how a fault propagates to an error and leads to a failure in a
system.

3.3 Degree of operation

In fault tolerance there exists four types of degrees explaining a system’s current ability of
operation after one fault [13]. These degrees are:

• Fail operational - System is still operational after a fault.

• Degraded operational - System is operational but at degraded functionality after a fault.
Degraded operation refers to an acceptable operation that alters from the correct system
behaviour. An example of degraded operational is loosing control of one brake actuator in
a car’s brake system. Since three out of four brake actuators still are functional, the brake
system is at degraded mode.

• Fail safe - System at safe operation after a fault. Safe operation is when the system goes to
a safe operational state in order to prevent critical failure. Safe operation may for example
be fail silent. Fail silent is when the system does not send output data in case of failure,
which is in some cases a better alternative than sending faulty data.

• Critical failure - System at critical operation after a fault. A critical operation is a fault
that propagates through the system and results in a failure. The failure makes the system
differ from its correct behaviour so that it may result in a catastrophic event. An example
of a critical failure is loosing the steering functionality in a vehicle.

6


3 TECHNICAL BACKGROUND

3.4 Integrity level

Nuclear plants, airplanes, vehicles and toasters are all systems where safe operation is of impor-
tance. The safety clearly involves the risk of potential hazardous events, usage of the system and
the severity if a hazard occurs. Since different requirements are needed for different safety sys-
tems the concept of integrity level arises. Safety integrity level (SIL) is based on the probability
of a system to perform its intended functions within a period of time [9].

The SIL determines the required failure rate of the given system. The international standard
concerning E/E and programmable electronic devices are known as IEC 1508 [14][9]. The failure
rates for the continues mode of operation, according to the standard, is stated in table 3.1.

Table 3.1: Required failure rate according to the Safety Integrity Level (SIL) according to the
standard IEC 1508 [9].

SIL Failure per year
4 ≥ 10−5 to < 10−4

3 ≥ 10−4 to < 10−3

2 ≥ 10−3 to < 10−2

1 ≥ 10−2 to < 10−1

With the arising of the standard ISO 26262 [4], safety integrity level for automotive vehicles has
been assessed. The Automotive Safety Integrity Level (ASIL) is assessed from the hazardous
outcome related to the failure of a system. Each hazardous event is assessed from the outcome
based on severity of the injuries, amount of time the system is exposed to the possibility of
the hazardous event of occurring, and the controllability that the driver can act to prevent any
injuries. The hazardous event is then assigned an ASIL according to the ISO 26262 standard.
The ASILs are classified A, B, C, D and QM. ASIL D is classified for the most severe hazards
and A for least severe. Quality Managment (QM) is set to a system where there is no safety
required. The ASIL is further explained in section 6 of this thesis.

3.5 High Reliability

This section elaborates the term of high reliability of a system that is used in this thesis. A
system’s failure rate is considered to be improbable if it has a failure rate lower than 10−9

failures/hour [13][9].

This thesis suggests a high reliability to be the same as the lowest reasonable failure rate for
a simplex system. By calculating the system reliability for one year using an exponentially
distributed model for a simplex system, a high reliability of 99.999% (five-nines) is reached. This
is considered to be a reasonable value for a high reliability system.

3.6 Single-point and multiple-point faults

A single-point fault is the kind of fault that violates the safety goal of the system according to
part 1 of ISO 26262 [4]. Therefore it is of importance to cover as many single-point faults as
possible with safety mechanisms in order to prevent them from propagating into a failure. An
example of a single-point fault in Figure 2.2 is if the communication bus gets disconnected. If
the CCU cannot communicate with the ECUs, this will cause the whole system to fail.

7


3 TECHNICAL BACKGROUND

Amultiple-point fault is when the occurrence of multiple independent faults may violate the safety
goal, according to part 1 of ISO 26262 [4]. The safety goal is violated when the system enters
the state of critical failure. When there exists a safety mechanism that is designed to prevent
a certain failure to violate the safety goal, this certain failure will be classified as; detected,
perceived or latent. If the failure mode is detected by the safety mechanism it is classified as
detected. If the failure is not detected by the safety mechanism but perceived by the user, it
is classified as perceived. If the failure is neither detected or perceived, it is to be classified as
latent. A latent fault does not directly cause the system to fail. However, in combination with
the failure of another subsystem, the latent fault may become active and cause the whole system
to fail, i.e., a multiple-point fault. It is stated in ISO 26262 that only two independent faults are
to be considered for multiple-point faults unless they are shown to be of relevance.

3.7 Techniques for fault tolerance

To make a system more safe and reliable, safety mechanisms have to be added for the system to be
able to tolerate certain faults, and prevent the system from propagating to a critical failure. This
section describes different techniques for increasing the reliability of a system. Fault tolerance can
be implemented in both hardware (HW) and software (SW). HW implemented fault tolerance
can be made by utilizing duplex-modular redundancy or a lock-step CPU to detect a fault [9]. The
combination of triple-modular redundancy and a hardware voter can be used in order to prevent
a failure. Fault tolerance implemented in SW can be established by implementing temporal
redundancy, information redundancy, voting and/or forecasting [9].

Temporal redundancy is when information is processed multiple times, in different aspects of
time, and use voting to ensure that the system has not been introduced to any effects causing
incorrectness. This adds protection against transient faults such as bit flips in a register etc.

Information redundancy is to store several copies to ensure that the information is not being
altered. A voting algorithm, e.g. majority voting, is used to decide which value is the correct
one.

The use of duplex-modular redundancy in HW means to provide a backup system to increase
safety. The backup system can either work in hot-stand-by (backup system working in parallel),
warm-stand-by (backup system getting check-points) and cold-stand-by (backup system starts
when primary unit fails). Using triple-module redundancy is to provide two backup systems.
If one of the systems should fail, the other two can still work as a duplex-modular redundant
system. By adding a redundant backup system, the overall safety is increased. However, by
introducing more components, the fail rate is also increased. An analysis has to be made to
ensure that the fail rate does not increase more than the overall safety of the system. In that
case, there is no need to add a redundant system in the first place.

A lock-step CPU is a fault-detection technique used to increase the safety of a system. Lock
step uses several identical CPUs and compares the outcome of each CPU. If the outcome of one
CPU differs from the other CPUs, this CPU is to be treated as faulty. If the outcome differs in
a lock-step system with only two CPUs, it is still possible to say that a failure is detected but
not which of the two CPUs that is incorrect. This technique can be used to prevent a fault from
propagating to a failure. For example, if a fault occurs in one of the CPUs such as a bit flip in
a memory cell or register, fault detection is made by the lock-step technique. Since the fault is
detected, the system’s safety mechanism can prevent system failure.

Voting in software corresponds to reading information multiple times or from different sources
(providing the same information). The information is then analyzed using a voting algorithm
to ensure that the correct, or most probable, value is returned. When using majority voting to

8


3 TECHNICAL BACKGROUND

retrieve the correct value of an input, fault detection can be made if the majority of the values
differ from each other. Fail prevention can be made only if a majority of the values are the same.
For example, if two out of three values are the same, fail prevention is established.

In addition to the triple-modular redundant system, an external hardware voter can be used.
The output from the three modules can be compared and the vote can occur instantly. A
hardware voter adds to the system cost and complexity which may increase the total fail rate of
the system.

Forecasting refers to predicting an output when input information differs. This can especially
be good when checking the correctness of the value for fault detection. Another way can be to
predict a correct value to avoid critical failures. This could be better than potentially sending
bad information which can propagate to catastrophic events in the whole system.

3.8 Fault injection

Fault injection is a method for testing and verifying the robustness of a system, and to verify the
effectiveness of the fault tolerant mechanisms. By introducing faults such as bit flips in registers
or memory, one can analyse whether the system detects the fault or not. If a fault is detected,
the system should be able to either prevent it from propagating or continue to a safe state.
There exist mainly four types of fault injection techniques to test and verify the robustness of
systems. These are software-implemented fault injection, physical-implemented fault injection,
radiation-implemented and simulation-based fault injection [15].

In physical-implemented fault injection, two common techniques are pin-level fault injection and
test port-based fault injection. Pin-level fault injection can be done by using a probe to force
different signals to high or low on circuit level. In test port-based fault injection, test access ports
are used to interact with the integrated circuit to inject bit flips into memory and registers.

Software-implemented fault injection can either be made on runtime or pre runtime. On pre
runtime the instructions or predefined values are altered. Runtime software-implemented fault
injections demands additional software which in its turn injects faults to the system.

Radiation-implemented fault injection can be done by exposing the system to electromagnetic
interference or heavy-ion radiation.

Simulation-based fault injection is technique for making a model of the original system and testing
it in a virtual environment. For example, when testing a power plant it is not wise to put the
reactor in a potentially critical state when testing it. Instead simulation-based testing is used,
where the system’s environment is simulated.

3.9 Architecture designs

This section explains different architectures for a drive-by-wire system described in "On dis-
tributed control-by-wire system for critical applications" by Roger Johansson [13]. The three
architectures explained are the local, centralized and distributed architectures.

3.9.1 Local

A local architecture is shown in Figure 3.2. It is the first generation of architectures and is
simple for understanding, designing and implementing. Each subsystem contains its own sensor
and electronic control unit (ECU) which is connected to the actuator. The disadvantages with

9


3 TECHNICAL BACKGROUND

the local architecture is that sensors may need to be duplicated when information is needed in
different controls, which is generally costly and not practical. From a safety perspective, the
usage of a local architecture in a drive-by-wire system is unsafe since every subsystem is a single-
point of failure. This means that if either one of the components shown in Figure 3.2 does fail,
the drive-by-wire system as a whole fails.

���������

	��	
�

��������

��

���
����

��

������

��

���
�����

	��	
�

������

	��	
�

��������� ��������� ���������

Figure 3.2: Local architecture: Each subsystem consists of one sensor and one electronic control
unit (ECU).

3.9.2 Centralized

A centralized system is based on one common control unit that gathers all the sensor information
and distributes it to the actuators. It is the second generation of architectures and is suitable for
systems that have a natural safe state (e.g. the system automatically shuts down when a failure
is detected). The advantage of the centralized architecture is that information from all sensors
is available for processing and distribution to the actuators. When sensors from different sources
are available for a control unit, their values can be combined to form a more general picture of the
vehicles condition. For example, if the user apply both brake and throttle, there is no reason for
the actuators to act accordingly since this can damage the vehicle. A centralized architecture for
a drive-by-wire system is shown in Figure 3.3. From a safety perspective it is not recommended
for safety-critical applications, since the architecture consists of several single-point of failures
(e.g. control unit and communication bus). With the usage of additional software and hardware,
it may be suitable only if it can be redundant enough so that safety criteria is met.

10


3 TECHNICAL BACKGROUND

���������

	��	
�

���
����

	��	
�

�����

	��	
�

���

��������

���

������

���

���
���

���

��	

Figure 3.3: Centralized architecture: The system concists of one central control unit (CCU) for
all the sensors and electronic control units (ECUs).

3.9.3 Distributed

The distributed architecture is the third generation of architectures and is shown in Figure
3.4. It is the most highly recommended architecture for safe-critical systems according to "On
distributed control-by-wire system for critical applications" by Roger Johansson [13].

Compared to the centralized architecture, the distributed architecture divides functions into sub
functions. This arises several redundancies and increases safety for a vehicle. One example is to
have distribution in the brake system. When distributing the brakes into two electronic control
units (ECUs), one ECU controlling the brake actuators on the front right wheel and back left
wheel, brake system 1, and the other ECU controlling the brake actuators on the front left wheel
and back right wheel, brake system 2. If brake system 1 fails, the vehicle still has degraded
functionality in the brakes. When distributing brakes to all four wheels, degraded steering can
be obtained when main steering system has failed. This gives additional redundancy to the
drive-by-wire system in both the brakes and steering. The concept of steer-by-brakes is further
elaborated in section 4.4.2.

11


3 TECHNICAL BACKGROUND

���

������	
�

��	���

��������

��	���

������

��	���

������	
�

���

��������

���

���

�����

�����

�����

�����

Figure 3.4: Distributed architecture: The brake ECU is distributed to control the brakes on each
side of the vehicle.

12


4 CONCEPT PHASE

4 Concept phase

Before implementing a system, this thesis follows the guidelines of the concept phase described
in part 3 of the ISO 26262 standard [4]. Figure 4.1 shows the overview of the safety life cycle of
the ISO 26262 standard. The dashed marked phases are not included in this report. This section
deals with the concept phase of the safety life cycle.

First, an item definition is made where the reference system is described at functional level.
Then the initiation of the safety life cycle is analysed, to determine if a new development should
be initiated or a modification of the reference system will be made. A hazard analysis and
risk assessment is then made of the reference system to determine the hazardous events. Each
hazardous event is then classified to an ASIL according to the ISO 26262. Lastly, a functional
safety concept is made to extract the safety goal of the developed system.

���������	���
	

�	������
	�
������

����������������

��������	�����

�	������

����	�

��	���
	���

�������
	����

���������	
��

����������	
�	


��	��
�����

�������

��������
	

��	���
	���

������

����	�

��������
��

��
�����
	

�������� ��������

��
������
	

�����	����
�������


���


���������������

�
��
�
�
����

	
��
�

	
�
�

�
���

����
�
���

��
	

�
���

���
��

�
����

�

Figure 4.1: Overview of the safety life cycle for the concept phase. Phases marked with a dashed
frame are not included in the thesis.

4.1 Item definition

An item is one or several systems that implement a function at vehicle level. An element refers
to electrical and electronic (E/E) components of the system or other technologies1. An item is
built of one or several elements. The items that constitutes the drive-by-wire system is: steer-by-
wire, brake-by-wire and throttle-by-wire. These items and their functions are further described
in Table 4.1. The functionality of the drive-by-wire system is considered failed if either of these
items fail.

1other technologies refers to technologies that are out of the scope of the ISO 26262 standard, such as hydraulic
or mechanical technologies [4].

13


4 CONCEPT PHASE

Table 4.1: The items of the drive-by-wire system and description of their functionality.

Item Function Description of the intended function
Steer-by-wire Steer control Control the steering of the vehicle
Throttle-by-wire Speed control Increase the speed of the vehicle
Brake-by-wire Speed control Decrease the speed of the vehicle

4.1.1 Functional concept

This section provides information about the functional and non-functional requirements2 of the
different items. The purpose of the functionality is explained with a use case and the different
operating modes are stated for the functions. The interaction between the elements of the items
shown in Table 4.1 are explained, including the elements based on other technologies.

To understand the user’s interaction with the system, a use case is developed. Figure 4.2 shows
how the user interacts with the different functions of the drive-by-wire system. Since the user
needs to be able to control the steering angle and speed regulation of the vehicle, the main
functions: steer control and speed control, are defined. The steer control needs to be able to steer
the vehicle right and left using the front wheels. The speed control needs to be able to increase
and decrease the speed of the vehicle. For speed interaction the vehicle consists of a throttle and
a brake pedal and for steering-angle interaction the vehicle has a steering wheel.

����

������

	
���


��	������

�����

��	������

�����

�������������

�����

������

	
���


���������

�����������

�

�

�

�

�

�

�

�

�

	
�

Figure 4.2: Use case of the items showing the different functions and how the user interacts with
the functions.

The drive-by-wire system have to consist of the following elements, to be able provide the func-
tionality described in Figure 4.2.

• Central control unit (CCU).

• Brake and throttle electronic control unit (ECU).

• Steer ECU.

• Steering wheel and pedals including sensors.

• Communication bus.
2non-functional requirement refers to criteria used to judge the operation of an item, rather than the behaviour

[4].

14


4 CONCEPT PHASE

• Brake, throttle and steering actuators.

The main purpose of the CCU is to convert the output values of the steering sensor, braking
sensor and the throttle sensor into recognizable values and transfer them using the communication
bus. Each ECU fetches the data from the communication bus of its interest. The ECU then
outputs the value to its corresponding actuator which controls the functionality intended. The
interaction between the different elements are shown in Figure 4.3.

���������	
���

�����

�

�

�

�

�

�

�

�

�

	
�

���

���������

	�
������ 	�
���
��
����

������
��
���

���������
��
���������������

����������	��


���������	��

������� �����	��

Figure 4.3: The interaction between the elements of the items; Steer-by-wire, Brake-by-wire and
Throttle-by-wire.

The functional concept is only bounded to the functions of the drive-by-wire system. It should
be mentioned that the prototype of the drive-by-wire system does not include vital elements such
as power supply of the vehicle, real-sized actuators and a full sized engine or reasonable sensors
for steering wheel and pedals. Considerations regarding these elements are to be made in real
implementation of the drive-by-wire system.

4.1.2 Usage in different environments

An analysis regarding how the environmental factors contributes to the occurrence of hazardous
events are to be established. In order to do so, different environments have to be specified and
evaluated regarding the probability of exposure. Explanation of the environments the drive-by-
wire system is considered to be frequently exposed to is shown in Table 4.2. The frequency is
described in the terms often or seldom according to part 3 of the ISO 26262 standard [4]. Often
refers to situation occurs during almost every drive on average. Seldom refers to the situation
occurs a few times a year for the great majority of drivers. The impact refers to the impact the
environment has on the drive-by-wire system. Low refers to no or very small impact and high
refers to significant impact on the system. The impact on the drive-by-wire system is high only
in manoeuvre situations. Manoeuvre situations is therefore the considered environment when
evaluating the risk in section 4.2.

15


4 CONCEPT PHASE

Table 4.2: Explanation of the environments the drive-by-wire system is considered to be exposed
to.

Environment Examples of situation Frequency Impact
Road layout Highway, secondary road, country road. Often Low

Highway exit ramp, intersection Seldom Low
Road surface Dry or wet road, asphalt. Often Low

Packed snow and ice, slippery leafs on road, gravel. Seldom Low
Manoeuvre Accelerating, decelerating, steering vehicle Often High
situations Driving in reverse, parking. Seldom High

4.2 Hazard analysis and risk assessment

Based on the item definition, a hazard analysis and risk assessment (HARA) is made and safety
goals are established. The HARA determines the potential hazards of an item and safety goals
can be determined. First, the classification of the impact factors; severity, probability of exposure
and controllability are elaborated for the item. From these factors an ASIL of the hazardous
events can be determined. Then a safety goal can be assigned for the item according to ISO
26262 [4].

Severity is the classification of the injuries caused by the hazardous event on the driver, other
road users or pedestrians. Severity has the classifications S0-S3 where class S0 refers to no injuries
and S3 refers to life-threatening injuries caused by the hazardous event. Table 4.3 explains how
different classifications are interpreted according to the ISO 26262 standard [4].

Table 4.3: Description of the injury impact for the different severity classification S0-S3.

Classification S0 S1 S2 S3
Description None Light Severe Fatal

Exposure is the probability of exposure for the system or subsystem, i.e has a low exposure
that can cause the hazardous event. For example if a system is rarely used, i.e has a low
exposure, the probability of the hazardous event occurring in that system is low. Exposure has
the classifications E0-E4, where E0 is a system that has an incredibly low exposure and E4 a
high probability of exposure. E0 could for example be a system that is never used (e.g a system
considered to be used for future use), the exposure of that system is then defined as incredible.
An exposure of E4 means that the system is used almost every drive on average, e.g systems
for steering, throttle or brake. Table 4.4 shows how the different classifications are interpreted
according to the ISO 26262 standard [4].

Table 4.4: Description of the probability of exposure of the systems that can cause the hazardous
event. The exposure is classified in E0-E4 and determination is based on operational situation
of the target.

Classification E0 E1 E2 E3 E4
Description Incredible Once a year Few times a year Once a month Every drive

Classification of the controllability is the plausibility that the driver, or other person at risk, can
gain sufficient control to avoid the hazardous event from happening. Controllability is classified
as C0-C3 according to Table 4.5. A controllability classification of C0 is for example when the
system causing the potential hazardous event goes into a safe state, e.g some driver assistance

16


4 CONCEPT PHASE

systems can for instance just be turned off. A controllability classification of C3 is when driver
cannot avoid the hazardous event from happening due to the system failure, e.g the brake system
stops working. Assumptions of the controllability are decided according to ISO 26262 part 3 [4].
It is stated that the driver is considered to be in appropriate condition (e.g not influenced by drugs
or tiredness), has proper driver training, normal physical health and reaction and complying with
legal regulations. Controllability also include the avoidance of pedestrians’ (or other road users’)
actions to avoid the hazardous events.

Table 4.5: Description of the controllability of the drivers or other persons at risk to avoid the
hazardous event. The controllability is classified in C0-C3.

Classification C0 C1 C2 C3
Description Controllable in general Simply Normally Difficult or uncontrollable

4.2.1 Hazard identification and classification

This section identifies the hazardous events that can occur in case the item fails. To decide the
ASIL of different hazards that may occur in the functional level of a system, a function and
failure analysis (FFA) is made.

The FFA is shown in Table 4.6. The basic functions of the item with different class of failures,
i.e., omission, commission and stuck-at value. Omission means that the function does not return
a value, i.e., fail silent. Commission means that the function returns random values, sometimes
referred to as babbling idiot. Stuck-at value means that the functions returns a constant value.
For each function a worst case environment and worst case failures are determine and the different
classifications from the hazardous events are set. The worst case environments are either slow
driving in dense traffic area or high speed traffic such as a highway. The classification are: severity
(S), probability of exposure (E) and controllability (C). From the classifications the ASIL are
extracted using table 4.7.

Table 4.6: Hazard identification and ASIL determination based on the class of failures of the
worst case environments and failures.

Function Class of failure Worst case environment/failure Classification ASIL
Accelerate Omission Railroad crossing / No acceleration S3,E1,C2 QM

Omission Highway / No acceleration S1,E4,C1 QM
Commission City (traffic light) / Sudden acceleration S3,E4,C3 D
Stuck at value City / Constant acceleration S3,E4,C3 D

Brake Omission City / Loss of brake S3,E4,C3 D
Commission Highway / Sudden brake S3,E4,C3 D
Stuck at value Highway /Constant retardation S2,E4,C2 B

Steering Omission Highway / Loss of steering S3,E4,C3 D
Commission Highway /Sudden steering angle S3,E4,C3 D
Stuck at value Highway / constant steering angle S3,E4,C3 D

17


4 CONCEPT PHASE

Table 4.7: ASIL determination according to part 3 of the ISO 26262 standard [4].

C1 C2 C3
S1 E1 QM QM QM
S1 E2 QM QM QM
S1 E3 QM QM A
S1 E4 QM A B

S2 E1 QM QM QM
S2 E2 QM QM A
S2 E3 QM A B
S2 E4 A B C

S3 E1 QM QM A
S3 E2 QM A B
S3 E3 A B C
S3 E4 B C D

4.2.2 Safety goal

For each hazardous event with corresponding ASIL, a safety goal is defined in this section. The
hazardous events with their corresponding ASIL is shown in table 4.6. If information to the
actuators cannot be provided correctly, the FFA presents which default message should be sent
in case of system failure for the different functions. Accelerate has least severity when in omission.
The least severity of brake and steering is when stuck at value. Brake function should be able
to override the throttle functions. When accelerate has a commission or stuck-at-value failure,
the driver should be able to brake to avoid injury.

When the functionality is in degraded mode, i.e., one wheel unit has failed or steering has
failed, the driver should be notified to stop the vehicle. This will sufficiently lower the risk
of another failure occurring compared to having the driver driving around in a vehicle with
degraded functionality. The fault tolerant time interval should be sufficient time for the driver
to slow down and pull to the side of the road. The fault tolerant time interval is the time in
between the occurrence of a fault and a possible hazard [4].

The safety goal is to ensure that the system will not fail so that it causes a hazard. Therefore
these safety goals have been established for the item:

• The brake should be able to override throttle function.

• The brake functionality shall not fail.

• Driver should be warned when at degraded functionality.

• The throttle system needs to be fail silent in case of failure.

• The steer system needs to be stuck at value in case of failure, enabling steer-by-brakes
functionality.

When stated functionality shall not fail, i.e., the system needs to fulfill the the highest possible
reliability in these cases.

18


4 CONCEPT PHASE

4.3 Functional safety concept

After the safety goal is determined for each item, detailed safety requirements are derived to
achieve the safety goals. The safety requirements inherit the ASIL corresponding to the safety
goal. Safety goals can be achieved by combining different functions to put the driver in a safe
state. For example a failure that causes full throttle can be overridden by brakes if brakes is
given higher priority than the throttle. From the safety goals explained in section 4.2.2 and the
ASIL from Table 4.6 in section 4.2.1, the safety goal can be achieved by:

• Using a safe state (e.g switching off the system causing the hazardous event in case of
failure).

• Consider the fault tolerant time interval (e.g a warning lamp stating to pull over to the
side of the road). Reducing the exposure time in a critical event - where a hazard may or
may not occur.

• Including maximum values for unwanted changes in system. For example speed limit in
degraded mode.

The safety requirements are determined for the different systems of the functions. The differ-
ent systems are then connected to the different elements explained in Figure 4.4. The safety
requirements that are supposed to be implemented in E/E are:

• Brake-overriding-throttle function in the central control unit (CCU).

• Distributed brake system in the CCU and distributed brake ECUs.

• Redundancies in both hardware and software for the CCU.

• Steer-by-brakes compatibility for the brake ECU.

• Setting steering at forward position in degraded mode to enable steer-by-brakes.

• Notification to the driver, in case of failure, to inform that the vehicle is in degraded mode.

��������	
����������	�

	�����������	������
����	��

�������
����	��������������	��

����

��������������������

����������
����	������

����������
����	��

������������

� !"�#

#�����
���������

��������������������������

� !"�#

!��������	�������

������

������������

� !"�#

$������	�������������������	�

������%���������������	��

����
��

$�����������������������	�

������	������&	����	������

�����	������
��'���������

�����%�%�������
����	������

(��
�������������

������������

� !"�#

 ����%�%�����������	�	��

�����������������

� !"�#

 ������������	������&	����	��

��������� ��������

� !"�#

��������	�
��

�������

�����������

(��
������	��)�(�

Figure 4.4: Safety requirements with their corresponding element, determined from the safety
goals.

The design decisions drawn from the safety goal is explained in section 5. The functional safety
is further elaborated in section 6.

19


4 CONCEPT PHASE

4.4 Modifications to the reference design

The reference drive-by-wire system with no fault-tolerance, explained in section 2, is modified for
developing a system that is suitable for fault-tolerance. This section explains the modifications
done to the reference drive-by-wire system and the impacts of these modifications.

4.4.1 Distributed brake system

The reference system only contains one brake ECU. When implementing a distributed architec-
ture the system can sustain a higher reliability since the distributed brakes are independent of
each other. Therefore a modification has been made to use a distributed brake system for the
drive-by-wire system. The distributed brake system will use two ECUs to control two diagonal
actuators each.

4.4.2 Steer-by-brakes

When distributing the brake system, a technology for emergency steering can be implemented.
A steer-by-brake [16][17] functionality will enable the driver to, in a very limited way, turn the
car left and right using the steering wheel and the distributed brakes. This will be a critical
backup system if the steering system suffers from a critical failure.

The distribution of the brake ECU into two ECUs that controls two actuators, makes it possible
to control the amount of brake impact on each side of the vehicle separately. By applying brake
force to both wheels on one of the sides of the car, there will appear a difference in rotational
speed between the both sides. This difference in rotational speed will generate a torque around
the center of the car, making the car slightly turn towards the side on which the brake force are
being applied. In case of a failure in the ECU responsible for controlling the steer actuator, it
is still possible to stop the car at the side of the road by utilizing the distribution of the brakes.
This assumes that both brake ECUs are functional.

Another way of utilizing the distribution of the wheels to maintain steering capability, in case of
a failure in steer ECU, is to apply brake force on one of the front wheels. According to [18], this
will generate a yaw moment on the front axis which will enable to control the steering angle of
the front wheels. This does however assume feedback of the steering angle. It does also require
that both brake ECUs and all brake actuators are functional.

4.4.3 Impact analysis

When modifying the reference system with a distributed brake system and steer-by-brake func-
tionality, changes are made in the functionality of the system. The two modifications are impor-
tant to increase the safety of the drive-by-wire system. Figure 4.5 shows the modifications to
the use case (compare with Figure 4.2) and Figure 4.6 shows the modification to the interaction
diagram (compare with Figure 4.3). The steer control is now able to control the vehicle via
traditional steering (i.e controlling the angle of the front wheels) and with distribution of the
brakes (i.e steer-by-brakes).

20


4 CONCEPT PHASE

����

������

	
���


��	������

�����

��	������

�����

�������������

�����

������

	
���


�������
���

��������

���������

������

�

�

�

�

�

�

�

�

�

	
�

�����������

���������

�����������

���������

Figure 4.5: Use case of the modified items explaining how the user interact with the modified
system. The dashed lines indicates that the steering functionality is a backup mode, only used
if the primary steering fails.

����������	
�����

���������	
���

�����

�

�

�

�

�

�

�

�

�

	
�

�

�����
�� �����
��������

��������
����������������
��

����������	��


���������	��������������������

�������� �����	��

����������	
�����

�����
�� �����
���������

Figure 4.6: The modified items’ interaction with the different subsystems and its corresponding
elements.

The modifications enables the steering and brake subsystem to be operational in degraded mode.
The three different modes are explained below for the distributed brake system regarding brake
functionality.

• Full functionality - Both distributed brakes are operational.

• Degraded functionality - One of the two distributed brakes are functional.

• System failure - Both distributed brakes have failed.

The three different modes for the steering functionality are explained below:

• Full functionality - Traditional steering unit is operational.

21


4 CONCEPT PHASE

• Degraded functionality - Both distributed brakes are operational.

• System failure - Primary unit and at least one of the distributed brakes has failed, steering
capability can no longer be ensured.

The two operations complement each other. If the brakes are at degraded functionality, the
system can warn the driver of degraded functionality of the vehicle. If the steering fails the
system can warn the driver for the degraded mode.

22


5 PRODUCT DEVELOPMENT

5 Product development

This section explains the redundant strategies made for the development of the central control
units (CCUs). It follows the product development phase of the ISO 26262, part 4, 5 and 6
[4]. The safety life cycle of the product development is shown in Figure 5.1. The development
follows a top-down approach, starting from development at system level and branching down to
the implementation at hardware and software level.

���������	���
	

�	������
	�
������

����������������

��������	�����

�	������

����	�

��	���
	���

�������
	����

���������	
	���	��

����������	
�	


��	��
�����

�������

��������
	

��	���
	���

������

����	�

��������
��

��
�����
	

�������� ��������

��
������
	

�����	����
�������


���


���������������

�
���

����
�
���

��
	

�
���

���
��

�
����

�
����	�������	

Figure 5.1: Overview of the safety life cycle for the product development. Phases and subphases
marked with a dashed frame are not included in the thesis.

5.1 System design

The reference system is using a centralized architecture, where the central control unit (CCU)
controls the ECUs for brake, throttle and steering. The architecture design for the reference
design is shown in Figure 5.2. Notice that the brake ECU controls all the actuators (marked
with A in the figure). Since the architecture of the reference design is not an efficient way to
introduce redundancy, a distributed architecture is developed in this thesis, explained in section 4.
A system overview of the distributed architecture design is shown in Figure 5.3. The architecture
uses separated brake ECUs to control one actuator on each side of the vehicle. Brake ECU 1
control the actuators A1, and brake ECU 2 control the actuators A2 (seen in Figure 5.3).

The weakest link of the architecture is the central control unit (CCU) since it has the highest
complexity and is therefore considered to have the highest fail rate in the system. The sensors
and CAN bus is out of range for this thesis and is therefore considered ideal in the sense that
they are immune to faults.

23


5 PRODUCT DEVELOPMENT

���

���������

	��	
�

���
���

	��	
�

������

	��	
�

�������	


���

�������


���

�
�
�
��
�
�

�

�����

���

�

� �

Figure 5.2: Centralized architecture for the
drive-by-wire used in the reference design.

������

���������

	��	
�

���
���

	��	
�

������

	��	
�

�������	


���

�������


���

�
�
�
��
�
�

��

�����


���
�

�����


���
�

��

�� ��

Figure 5.3: Distributed architecture used in
the fault-tolerant design. Brake ECU 1 con-
trols the actuators A1 and Brake ECU 2
controls actuators A2.

5.2 Hardware design

This section gives an overview of the hardware (HW) design of the product development. The
hardware design deals with the implementation of hardware parts to increase the reliability of
the fault-tolerant design.

5.2.1 Duplex-modular redundancy for central control unit

The hardware needs to be designed so that it has a failover system to ensure safety if the CCU
has a permanent error that results in a failure. The system will run in duplex mode where there
will be two CCUs having the same safe-critical functions for the drive-by-wire system. One of
the CCUs will act as main (or primary) CCU while the other CCU will run as backup.

The two CCUs can either work in hot-standby, warm-standby or cold-standby. Since the pro-
cessed sensor values for the ECUs are time critical, and resource utilization of the CPU is not
an issue in the referenced design [5], decision is made to use hot-standby failover in the design.
The backup CCU will work in hot-standby mode to take over the functions of the primary CCU
if the primary CCU fails.

24


5 PRODUCT DEVELOPMENT

5.2.2 Error detection

The development boards chosen for implementation of the CCUs already have error detection by
the implementation of a lock-step CPU. The lock-step CPU can detect both permanent hardware
faults and temporary faults occurring in the design which the software cannot handle. The lock-
step technique can be used to prevent a fault from propagating to a failure. Lock-step and error
detection is elaborated in section 3.7.

5.2.3 Reset handling

When a temporary error is detected in a central control unit (CCU), it will have to be restarted in
order to guarantee further safe execution. If the primary CCU has a permanent error resulting in
a failure, the secondary CCU will have to take over. The reset handling refers to the the design of
how the active CCU will be restarted if an error is detected. Active CCU can be referred to both
primary CCU and secondary CCU, depending on which CCU that is currently executing.

This section presents two designs for restarting a CCU; distributed-reset design, shown in Figure
5.4 and self-reset design, shown in Figure 5.5. The CCUs have a monitoring function (hereinafter
referred to as monitor) to detect when an error occurs in the CCU. If an error is detected, the
monitor will restart the erroneous CCU using a power switch located on the CCU. If the CCU
has a permanent failure, i.e., the error still appears after a restart, the secondary CCU will take
over and the primary CCU will be shut off.

In the distributed reset design, shown in Figure 5.4, the monitor is implemented in software (SW)
of the other CCU. The SW implemented monitors enables the CCUs to reset each other when
an error occurs by controlling the reset signal of its neighbours. This can be implemented in two
ways. Either each and every CCU gets the status of the error pin and the reset signal of all of
the other CCUs. Or each CCU gets the status of the error pin and the reset signal of the two
closest CCUs, shown in Figure 5.4. If this design is to be used in a system with several CCUs,
this demands a huge number of interconnects to enable reset handling between all CCUs.

In the self-reset design shown in Figure 5.5, each CCU will have an external hardware (HW)
monitor. This increases the cost and design complexity of the system when making a safe
implementation of the HW monitor. The self-reset design is suggested for systems where more
than two CCUs holds the safe-critical functions. The benefits of using this design is that it is
easy to further introduce more CCUs that runs the safe-critical functions.

Even though the distributed-reset design seems like the cheapest alternative for this project
where only two CCUs is to be used, the self-reset design is better from a safety perspective.
Since the monitor have the functionality to permanently shut down the CCU, it might occur
that a failure in the distributed self-reset monitor implemented in one of the CCUs causes the
other CCU to permanently shut down. This will result in a system failure since there will be no
functional CCU. However, if the same error occur in the case of the self-reset design, the primary
CCU will not be affected if the monitor of the secondary CCU forces it to shut down, since they
are independent of each other.

The monitor elaborated in this section is not included in the CCU and will be implemented in
this thesis. The implementation of the self-reset monitor is further explained in section 7.

25


5 PRODUCT DEVELOPMENT

��������	


�

�����	�


�

���������	���������	


����������

�		�	������

Figure 5.4: Distributed-reset design for the
central control unit (CCU) where the monitor
is implemented in software (SW).

���

������	

���

������	

��������	


���

�����	


���


����������

�
��
�
��
��
�
�
	

����������

�
��
�
��
��
�
�
	

�������

Figure 5.5: Self-reset design for the central
control unit (CCU) using a hardware (HW)
monitor for resets.

5.3 Software design

The next step is to design the software (SW) to ensure redundancy and safety in the CCU.
The SW design will be implemented in the CCU to handle the specific drive-by-wire function to
enable safety for the driver. The system needs to include both redundancy in time (temporal re-
dundancy) and information redundancy. Table 5.1 presents different methods for error detection
at software level [4]. The recommendation for each method is graded dependently of the current
ASIL of the item it is implemented in. "o" means that the method has no recommendation either
against or for the usage for the current ASIL. "+" means that the method is recommended for
the current ASIL. "++" means that the method is highly recommended for the current ASIL.
The ISO 26262 standard recommends that one or several of the methods in Table 5.1 are used
when developing the software design.

Table 5.1: Mechanisms for error detection at software level. [4]

Method A B C D
Range checks of input and output data ++ ++ ++ ++
Plausibility check + + + ++
Detection of data errors + + + +
External monitoring facility o + + ++
Control flow monitoring o + ++ ++
Diverse software design o o + ++

The methods for range checks of input and output data and plausibility check were chosen to be
part of the software design. An overview of the software design is shown in Figure 5.6.

26


5 PRODUCT DEVELOPMENT

���������

	�
���

�����

��������

�����

������

	�
���

�����
���	�����

��
�
����

���
��������

�����������

�����	���������

�
������

��
���

������

���
������

��
�
���

��������
�

��������������

���������

����������

��������

�������

����
����

�


������������

��������

 ���

���

!���������

��
�
����

	���������

�����������

�����	���������

�
������

�	�������������

���������

"����� ���


������#�$

���������

#�$

��������

#�$

� � � �

Figure 5.6: Software design of the safety mechanisms in the central control unit.

In block A the data is read from the sensors. A range check will be made for each reading to
ensure that the value is in range. By reading the values several times, temporal redundancy is
accomplished. By separating each read with a short time gap it is possible to detect whether the
data is continuous or not. For example if the sensor cable is broken, the data will most likely
be affected by high frequency noise. If the data varies to much in between each read, it will be
seen as faulty. A plausibility check is made by comparing the values with each other using a
median voter. If the measured values differ too much from each other and no median value can
be decided, the system will restart or hand over to a backup system.

Block B processes the information from the correct sensor value provided by the reference block.
It takes the sensor value, processes it several times and stores it in multiple places to ensure
information redundancy. If a bit flip, caused by e.g. ionizing particles, occurs in one of the
memory cells where the sensor values are stored, it will most likely affect other memory cells
which is located physically close to it. Therefore it is a good idea to physically separate the
memory location of each sensor value, to minimize the event of one bit flip altering the data of
several sensor values.

In block C the multiple values are retrieved from the memory. The multiple values stored in
the memory are read and majority voting is made to ensure their correctness. If majority of
the values from the memory are the same, the majority voter will ensure fault prevention by
providing the correct value. If majority voting cannot be done due to majority of the values are
different, the majority voter will ensure fault detection to prevent a system failure.

Block D handles transfer of the values to the ECUs. A range check of the output values is made
before sending them to the ECUs, to ensure that the values are in acceptable limit.

When adding the possibility to restart a CCU, considerations regarding state dependent infor-
mation have to be taken into account. The decision of using a duplex CCU adds the possibility
of retaining such information after a restart since the other CCU will hold the correct values.
If primary CCU have to be restarted, secondary CCU will continue execution and update even-
tual state dependent information. In the startup phase of the CCU, there will have to be some
functionality responsible of synchronizing such state dependent information.

27


5 PRODUCT DEVELOPMENT

In case of a system startup, both CCUs shall assume default values. However, some safety
mechanism needs to be implemented to ensure that no CCU assumes default value when the
system is actually running.

28


6 SAFETY VALIDATION

6 Safety validation

The safety validation is a part of the ISO 26262, shown in Figure 5.1 in section 5 of this thesis.
This section aims to verify the reliability and safety of the developed design explained in section
5 of this thesis. The developed design is hereinafter known as the fault-tolerant system.

Section 6.1 gives an overview of the reference design and the fault-tolerant design in a fault
tree diagram. Each system is then analysed to show the single point of failures in the two
systems.

In section 6.2, a comparison between the reliability of the two systems is made. An analysis is
made to show how the reliability of the two systems changes in time. The systems are modelled
using Markov chains. The section also gives a detailed analysis of the reliability for the fault-
tolerant system.

Section 6.3 makes an analysis of the safety for the fault-tolerant system. The steady-state safety
is calculated for each subsystem. Safety refers to the system’s ability to prevent failures that can
lead to a catastrophic event [19]. The steady-state safety is the probability that a fail-safe state
is reached, i.e., catastrophic failure is prevented.

6.1 Overview

The fault tree for the reference system is shown in Figure 6.1. The fault tree is elaborated from
the centralized architecture, shown in Figure 5.2. The fault tree shows all the single point of
failures in the reference system. Engine and power failures are marked with a yellow triangle
which indicates that there are underlying fault trees that needs to be further investigated. This
is however, not included in this thesis.

The fault-tolerant system is shown in Figure 6.2. The fault tree is elaborated from the dis-
tributed architecture, shown in Figure 5.3. The red triangle marked with steering/brake failure
indicates that there is an underlying subsystem that may cause a failure. When introducing the
functionality of steer-by-brakes, failure of the steering system becomes dependent of the brake
system. For example if one of the brake ECU fails, steering functionality can not be in degraded
mode. Therefore, failure of the steering ECU and one brake ECU will put the drive-by-wire
system in critical failure. Since this occurrence can not be represented by the use of a fault tree,
it will be further analysed using a Markov model in section 6.2. The red triangle of brake and
steer will be referred to as brake-and-steer system.

In addition to the steering and brake redundancy, the fault-tolerant design uses duplex CCUs
which adds to the fault tolerance of the system. For example, if one of the CCUs would fail, the
system would still be functional as shown in the fault tree in Figure 6.2. Improvements have also
been made to ensure safety of the brake-and-steer system.

The subsystems regarding the engine, power and sensors failure, shown in figures 6.1 and 6.2,
are not further analysed in this thesis. They are mentioned to make the reader aware of other
critical parts of the drive-by-wire system.

29


6 SAFETY VALIDATION

�������

�	
���

����

�	
���

��������

���

��

������

�	
���

����
���

�	
���

�	���

�	
���

����
���

���

�	���

���
���������

	�
	��

�����

	�
	��

������
��

	�
	��

��������

������
�	�
���

�	
���

����

�	
���

�������

�	
���

��

���
��

�	
���

Figure 6.1: Fault tree of the reference system. The yellow triangles indicate that there are
underlying fault trees.

�������

�	
���

����
���

�	��

�	
���

��

��������

���

�������

�	
���

��

���
��

�	
���

����

�	
���

������

�	
���

���������

	�
	��

�����

	�
	��

������
��

	�
	��

�����

������
�	�
���

�	
���

����

�	
���

�������

���

�

	
������

���

Figure 6.2: Fault tree for the fault-tolerant system. The yellow triangles indicate that there are
underlying fault trees. The steering/brake failure is analysed in section 6.2.

6.2 System reliability

Reliability is the system’s ability to perform its intended function on demand without failure
[19]. Since the reference system does not have any degraded state, both systems are considered
to be functional until failed.

This section gives a comparison between the reliability of the reference system and the fault-
tolerant system and shows how the reliability decreases over time for both systems. The com-

30


6 SAFETY VALIDATION

parison is made for the central control unit (CCU) and the brake-and-steer system. Since the
CCU and brake-and-steer system are independent of each other, a model can be extracted using
exponentially-distributed continuous-time Markov modelling [9]. A Q-matrix is extracted from
the Markov model and the ordinary differential equations (ODE) are set. The ODE are then
solved using Laplace transforms and MATLAB for finding the solution of the ODE under a given
time.

The failure rates of the systems are estimated from the Safety Integrity Level (SIL) of the
standard IEC 1508 (see section 3.4) and the quantitative analysis from the ISO 26262 standard
[4] (see section 3.4). The lifespan of a vehicle is estimated to be 10 years according to studies
made by [20] and [21].

6.2.1 Central control unit

The design of the CCU in the reference system uses a simplex design. The Markov model is
shown in Figure 6.3. A failure of the CCU will result in a system failure.

�������

�	
���

���

�������

���

�
���

Figure 6.3: Markov model of the central control unit (CCU) for the reference system. λCCU is
the failure rate for the CCU.

The Q-matrix extracted from the Markov model is shown in Equation 1a. The extracted ODE
from the Q-matrix is shown in Equation 1b. Equation 1c shows the Laplace transform to the
s-domain of the ODE. The expression is then transformed back to the time domain using inverse
Laplace transform and the probability of being in state A according to Figure 6.3, is expressed.
The calculated reliability of the reference system’s CCU is shown in Equation 1d.

Q =

[
−λCCU λCCU

0 0

]
(1a)

P ′A(t) = −λCCUPA(t) (1b)

L ⇒ PA(s) =
1

s+ λCCU
⇒ L−1 ⇒ PA(t) = e−λCCU t (1c)

RCCU = PA(t)⇔ RCCU = e−λCCU t (1d)

The current CCU design uses a duplex system, and the Markov model is shown in Figure 6.4.
In each state, there is a number combination. The left number shows how many working CCUs
are in the state, the middle number shows how many CCUs are under repair (under restart)
and the rightmost number shows how many CCUs are in a non-reparable state (permanently
damaged).

31


6 SAFETY VALIDATION

The initial state is state A shown in Figure 6.4. In state A both CCUs are functional. If a
transient or intermittent fault occurs and is covered by the system, transition to state B will
occur. State B indicates that there is one working CCU and one under repair. The CCU under
repair is restarted by the hardware monitor and a transition back to A will occur in the repair
ratio, µr. The repair ratio is equal to the (restart_time)−1 of the CCU, the restart time is
estimated to be approximately 1 ms. The system covers both transient and intermittent faults.
Since the repair of a CCU uses power-off restart, an intermittent failure will continuously restart
the system until the fault disappears. The restart of the system is further elaborated in section
7.

If another failure occurs to the working CCU under the time when the first CCU is restarting,
transition from state B to state D will occur. Since both CCUs have failed, the whole system
is considered as failed. Both CCUs can also be exposed to a non-covered fault. This means
that the system will transit from state A to state D, i.e system failure. A non-covered fault is
not detected by the system and will therefore not be restarted by its hardware monitor. The
hardware monitor can only restart the CCU when a failure is detected. If a fault that alters the
system from its correct behaviour occurs, transition to state D will occur.

The CCUs can also be exposed to a permanent fault, i.e fault that is not repairable, e.g a broken
transistor in the CCU or fault in the hardware monitor. If the permanent fault is covered,
transition from state A to C will occur. In state C one CCU is working and the other one is
detected as permanently damaged and shut of by the hardware monitor. In this state the system
has no repair-rate since a restart of the CCU can not repair the permanent damage. If another
fault occurs to the working CCU, transition to from state C to D occurs.

In Figure 6.4, λt is the transient failure rate and λp is the permanent failure rate for one CCU.
The repair rate is referenced as µr for one CCU, ct is the coverage for transient and intermittent
faults and cp is the coverage for permanent failures.

�������

�	
���

���

�����

���

����
�

���
�

�����
�

���
�

��
�

�
�

�����

���

�����

���

�
�

�
�
��
�

� �

�

� �
� �

�
�

��
�

Figure 6.4: Markov model of the central control units (CCU) for the fault-tolerant system.

Equation 2a shows the Q-matrix extracted from the Markov model for the CCU of the fault-
tolerant system (shown in Figure 6.4). Equation 2b shows the ODEs extracted from the Q-matrix.
Since the Markov model suffers from stiffness between state A and B due to the high repair rate
compared to the low failure rate of the transient failures (including intermittent faults) [22].
Solution in MATLAB based on the the numerical differentiation formulas [23] is used to solve

32


6 SAFETY VALIDATION

the ODEs shown in Equation 2b and calculate the reliability shown in Equation 2c.

Q =


−2(λt + λp) 2ctλt 2cpλp (1− ct)2λt + (1− cp)2λp

µr −(µr + λt + λp) 0 λt + λp
0 0 −(λt + λp) λt + λp
0 0 0 0

 (2a)

P ′A(t) = −2(λt + λp)PA(t) + µrP (B)
P ′B(t) = 2ctλtPA(t)− (µr + λt + λp)PB(t)
P ′C(t) = 2cpλpPA(t)− (λt + λp)PC(t)

(2b)

RCCU = PA(t) + PB(t) + PC(t) (2c)

Since the CCU is a E/E system, the failure rate is estimated from the SIL according to the IEC
1508 standard [9][14] (see section 3.4 in this thesis). Table 6.1 shows example values for failure
rates for one CCU with the different ASIL levels. The failure rates are extracted from Table
3.1 and approximated to failures per hour. The estimated failure rates are for transient and
intermittent failures.

Table 6.1: Example values for transient failure rates for different ASIL.

SIL (IEC 1508) Failures per year ASIL (ISO 26262) Failures per hour
4 ≥ 10−5 to < 10−4 D ≥ 10−9 to < 10−8

3 ≥ 10−4 to < 10−3 C ≥ 10−8 to < 10−7

2 ≥ 10−3 to < 10−2 B ≥ 10−7 to < 10−6

1 ≥ 10−2 to < 10−1 A ≥ 10−6 to < 10−5

QM None required

Table 6.1 shows a possible source for derivation of random hardware failure, stated in the ISO
26262 standard, part 5 [4]. It shows that the values from Table 6.1 follows a good estimation
related to the ISO 26262.

Table 6.2: Example values for random hardware failures values according to the ISO 26262 [4].

ASIL (ISO 26262) Failures per hour
D < 10−8

C < 10−7

B < 10−7

The factor between permanent and transient failures (including intermittent failures) are esti-
mated from the quantitative analysis treated in part 10 of the ISO 26262 standard [4]. The
permanent failure rate is estimated to be 10 times lower than the failure rate for the transient
failure rate. For example, if the transient failure rate is estimated to < 10−8, the permanent
failure rate is < 10−9. This is considered for both the previous and fault-tolerant system for the
CCU. The coverage factor are assumed to be the same for both permanent and transient failures
for the CCU.

The reliability extracted from Equation 2c, for the fault-tolerant system using duplex CCUs,
is compared to the reliability from Equation 1d, for the reference system using a simplex
CCU.

33


6 SAFETY VALIDATION

Figure 6.5 shows the comparison between the CCU for the reference system (dashed lines) and
the fault-tolerant system (continuous lines). The different failure rates are measured in failures
per hour and the coverage is considered ideal. The comparison shows that the reliability of the
fault-tolerant system is always higher than the reference system in the vehicle lifespan.

0 2 4 6 8 10 12 14 16 18 20 22
99.9

99.91

99.92

99.93

99.94

99.95

99.96

99.97

99.98

99.99

100

time (years)

R
el

ia
bi

lit
y 

(%
)

 
λ
t
=5*10−7

λ
t
=1*10−7

λ
t
=5*10−8

λ
t
=1*10−8

Figure 6.5: Reliability for the central control unit of the reference system (dashed line) and
fault-tolerant system (continuous line). The different failure rates are measured in failures per
hour and the coverage is considered ideal.

Figure 6.6 shows the reliability of the fault-tolerant system in the vehicle lifespan using an ideal
coverage. The figure shows that failure rates lower than or equal to λt = 10−7 failures per hour
meets a high reliability of five-nines (i.e 99.999%) in 10 years.

To see how the coverage of the system effects the reliability, further analysis has been made
where the coverage factor is decreased. Figure 6.7 shows the reliability of the systems using a
coverage factor of 99%. Failure rates lower than or equal to λt = 5 ∗ 10−9 failures per hour fulfils
the reliability of five-nines in 10 years.

Figure 6.8 shows the reliability of the systems using a coverage factor of 98%. The only tolerable
failure rate that meets the five-nine reliability in ten years is λt = 10−9 failures per hour.

Figure 6.9 shows the reliability of the system using a coverage factor of 95%. By further lowering
the coverage to 95%, the systems reaches its limit using a failure rate of λt = 10−9 failures per
hour. Lowering the coverage even further the five-nine reliability cannot be achieved for 10 years
lifespan of a vehicle.

34


6 SAFETY VALIDATION

0 2 4 6 8 10 12 14 16 18 20
99.999

99.9991

99.9992

99.9993

99.9994

99.9995

99.9996

99.9997

99.9998

99.9999

100

time (years)

R
el

ia
bi

lit
y 

(%
)

 
λ
t
=5*10−7

λ
t
=1*10−7

λ
t
=5*10−8

λ
t
=1*10−8

λ
t
=5*10−9

λ
t
=1*10−9

Figure 6.6: Reliability for the fault-tolerant system’s central control unit using ideal coverage.
The different failure rates is measured in failures per hour.

0 2 4 6 8 10 12 14 16 18 20
99.999

99.9991

99.9992

99.9993

99.9994

99.9995

99.9996

99.9997

99.9998

99.9999

100

time (years)

R
el

ia
bi

lit
y 

(%
)

 
λ
t
=5*10−7

λ
t
=1*10−7

λ
t
=5*10−8

λ
t
=1*10−8

λ
t
=5*10−9

λ
t
=1*10−9

Figure 6.7: Reliability for the fault-tolerant system’s central control unit using a coverage of
99%. The different failure rates are measured in failure per hour.

35


6 SAFETY VALIDATION

0 2 4 6 8 10 12 14 16 18 20
99.999

99.9991

99.9992

99.9993

99.9994

99.9995

99.9996

99.9997

99.9998

99.9999

100

time (years)

R
el

ia
bi

lit
y 

(%
)

 
λ
t
=5*10−7

λ
t
=1*10−7

λ
t
=5*10−8

λ
t
=1*10−8

λ
t
=5*10−9

λ
t
=1*10−9

Figure 6.8: Reliability for the fault-tolerant system’s central control unit using a coverage of
98%. The different failure rates are measured in failure per hour.

0 2 4 6 8 10 12 14 16 18 20
99.999

99.9991

99.9992

99.9993

99.9994

99.9995

99.9996

99.9997

99.9998

99.9999

100

time (years)

R
el

ia
bi

lit
y 

(%
)

 
λ
t
=5*10−7

λ
t
=1*10−7

λ
t
=5*10−8

λ
t
=1*10−8

λ
t
=5*10−9

λ
t
=1*10−9

Figure 6.9: Reliability for the fault-tolerant system’s central control unit using a coverage of
95%. The different failure rates are measured in failure per hour.

36


6 SAFETY VALIDATION

6.2.2 Brake-and-steer system

The reference system has two independent systems for the brake and steering ECU, seen in the
fault tree in Figure 6.1. The Markov model for the reference system is shown in Figure 6.10.
System initially start from state A. A failure in the brake system will result in a brake failure
(state B), where as a failure in the steer system will result in a steering failure (state C).

�����

����	��


��

�������

����	��


��

��������


��

� �

�
�

Figure 6.10: Markov model of the brake-and-steer system for the reference design. λB is the
failure rate for one brake ECU and λS for the primary steering unit.

The Q-matrix of the Markov model in Figure 6.10 is represented in Equation 3a. The ODE
extracted from the Q-matrix is shown in Equation 3b. The calculation of the Laplace transform
from the ODE, inverse-Laplace transform and reliability of being in state A is represented in
Equation 3c. The reliability of the previous brake system and steer system is represented in
Equation 3d.

Q =

−(λB + λS) λB λS)
0 0 0
0 0 0

 (3a)

P ′A(t) = −(λB + λS)PA(t) (3b)

L ⇒ PA(s) =
1

s+ (λB + λS)
⇒ L−1 ⇒ PA(t) = e−(λB+λS)t (3c)

RBS = PA(t)⇔ RBS = e−(λB+λS)t (3d)

The fault-tolerant system for brake and steer uses a distributed brake architecture which increases
safety for both steering and brakes. The brake system is still operational after failure of one
distributed brake ECU. The steer system also has additional redundancy since the steer-by-
brakes technology works as a backup if the steer ECU has failed using the distributed brakes
(both brake ECUs needs to be functional).

The Markov model of the fault-tolerant brake-and-steer system is shown in Figure 6.11. Initially
the system starts in state A, meaning that both brake and steering are at full functionality. The
failure rate for the brakes are represented as λB and the coverage for the brake system is cB.

37


6 SAFETY VALIDATION

Since there are two brakes ECUs in the vehicle, the transition from state A to B is two times
the failure rate for the brakes. The failure rate for the steer system is represented λs and the
coverage cs. When a failure of one brake ECU occurs, a transition from state A to B occurs.
The system is then in the degraded functionality for brakes. This means that the system is still
operational, but the brakes works in degraded mode. If another brake fails when in state B,
the system can no longer provide brake functionality for the vehicle. Therefore, transition from
state B to D occurs, i.e the brake system have failed. The functionality of the steer-by-brakes
is dependent on both brake ECUs being functional. If not, the functionality of steer-by-brakes
cannot be guaranteed. Therefore, if a steering failure occurs when in state B, transition from
state B to E will occur. State E implies that the steering has failed. When the vehicle is at
full functionality (state A), and suffers from a steering failure, transition from state A to C will
occur. State C implies that the steering is operational using the steer-by-brakes functionality for
steering. If one brake ECU fails, transition from state C to E will take place, i.e steering failure.
The system may also be exposed to non-covered faults. This is showed in Figure 6.11 where a
transition goes from state A to a failure state (D or E). A non-covered failure is for example a
non-detected fault in the brake or steering ECU that propagates to a system failure.

�����

����	��


��

�������

����	��


��

��������


��

��������

�	������������


���������


��

� � �
� �

�
��
�

���������

�	������������


�������


��

��

���

�
�

�� ���
�� ��

����
� ��

�

Figure 6.11: Markov model of the brake-and-steer system for the fault-tolerant design.

The Q-matrix extracted from the Markov model is shown in Equation 4a. The extracted ODE
from the Q-matrix is shown in Equation 4b. Equation