Digital Forensic Investigation of Automotive Systems: Requirements and Challenges Master’s thesis in Computer science and engineering Yitao Dong Jun Zhang Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Digital Forensic Investigation of Automotive Systems: Requirements and Challenges Yitao Dong Jun Zhang Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Digital Forensic Investigation of Automotive Systems: Requirements and Challenges Yitao Dong and Jun Zhang © Yitao Dong and Jun Zhang, 2023. Supervisor: Kim Strandberg, Volvo Cars Examiner: Tomas Olovsson, Chalmers Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2023 iv Digital Forensic Investigation of Automotive Systems: Requirements and Challenges Yitao Dong and Jun Zhang Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract With the increasing complexity of vehicle architecture and interconnection between the vehicle and other entities, many problems arise, such as cyber attacks. Accidents can be caused by hardware or software failures, intentional or unintentional incidents. Automotive Digital Forensics (ADF) is used for determine the cause of accidents and usually includes processes like data collection, data analysis, data preservation and documentation. ADF is a relatively new field and is not well-researched. The lack of common and unified industry guidelines and standards makes ADF challenging. We have investigated previous work within the automotive and similar areas, with the aim of identifying parts applicable to ADF, such as forensic mechanisms, guidelines, standards, fulfilment of security properties, and data extraction and verification. Furthermore, we propose a framework that considers the entire life cycle of ADF. Keywords: Automotive digital forensics, ADF, IoT, cyber security, V2X communica- tion, forensics guidelines, forensics model. v Acknowledgements We would like to express the most sincere gratitude to our supervisor, Kim Strandberg. During we writing the thesis, Kim helped us a lot on our work - answering our questions, meeting with us regularly and revising the thesis. With his assistance, we were able to finish our thesis smoothly. We are also grateful to our examiner Tomas Olovsson, who has given us invaluable inspiration and encouragement to complete this meaningful work. Furthermore, many thanks to the Cyber Security team at Volvo Cars for their kindness and friendly support during the writing of the thesis. Yitao Dong and Jun Zhang, Gothenburg, 2023-06-30 vii Contents List of Figures xi List of Tables xiii List of Acronyms xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Automotive digital forensics . . . . . . . . . . . . . . . . . . . 2 1.2.2 Hardware and protocols . . . . . . . . . . . . . . . . . . . . . 2 1.2.3 Regulations and standards . . . . . . . . . . . . . . . . . . . . 4 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related work 7 3 Methods 11 3.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Similar areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Avionics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.3 Railway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.4 Smart cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2 Cyber-investigation Analysis Standard Expression . . . . . . . 15 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.1 TRLs of technical solutions . . . . . . . . . . . . . . . . . . . 18 3.4.2 Gaps analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Results 21 4.1 Forensic lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 ix Contents 4.2.2 Feasibility analysis . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 Format unification . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Data preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Data retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.1 Block erasing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.2 Block accessing . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.7.1 Results collection . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.7.2 Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Discussion 41 5.1 Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4 Attacker model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusion 47 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Bibliography 49 x List of Figures 3.1 Standard CAN message frame . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Extended CAN message frame . . . . . . . . . . . . . . . . . . . . . . 15 3.3 LIN message frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Vehicle blockchain network architecture . . . . . . . . . . . . . . . . . 22 4.2 ADF framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Data classify strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Procedures of data preservation . . . . . . . . . . . . . . . . . . . . . 35 4.5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Results document form . . . . . . . . . . . . . . . . . . . . . . . . . . 39 xi List of Figures xii List of Tables 3.1 TRLs of technique solutions . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 ADF lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Tools and support file systems . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Comparison of three forensic tools . . . . . . . . . . . . . . . . . . . . 26 4.4 Characteristics and costs of extraction methods . . . . . . . . . . . . 27 4.5 CASE objects and descriptions . . . . . . . . . . . . . . . . . . . . . 30 4.6 Comparison of permissioned and permissionless blockchain . . . . . . 34 xiii List of Tables xiv List of Acronyms DF Digital Forensics ADF Automotive Digital Forensics V2X Vehicle-to-Everything IVN In-Vehicle Network ECU Electronic Control Unit EDR Event Data Recorder OBD-II On-Board Diagnostics-II JTAG Joint Test Action Group CAN Controller Area Network LIN Local Interconnect Network MOST Media Oriented System Transport GDPR General Data Protection Regulation VANETs Vehicular Ad-Hoc Networks DLC Diagnostic Link Connector UAVs Unmanned Aerial Vehicles JSON-LD JavaScript Object Notation for Linked Data RSU Roadside Units CASE Cyber-investigation Analysis Standard Expression TRLs Technology Readiness Levels UUID Universally Unique Identifier AL Applicability Level VIN Vehicle Identification Number VPKI Vehicular Public Key Infrastructure xv List of Tables xvi 1 Introduction According to Cambridge dictionary, the word "forensics" refers to the scientific methods of solving crimes, that involve examining objects or substances related to a crime [1]. From this definition, we conclude that the most important step of forensics is examining objects, i.e., collecting evidences, and finally use them to establish the crime timeline. During the traditional forensic process, investigators collect physical or biological evidences (e.g., fingerprints and DNA) with the aim to identify the suspect. Unlike traditional forensics, Digital Forensics (DF) emphasizes on collecting digital evidences such as the data in memory, hard drives or cloud. We mainly discuss Automotive Digital Forensics (ADF) in this thesis, which is a branch of digital forensics. This chapter introduces the background of our work, the challenges that currently exist and our purpose. 1.1 Background In recent years, the complexity of vehicle architecture has significantly increased, as well as the interconnection between vehicles and other entities, i.e., Vehicle-to- Everything (V2X) communication. The rapid development of automotive technology increases safety and makes autonomous driving come true, but it also raises many issues, such as cyber attacks. Accidents can be caused by hardware or software failures, intentional or non-intentional incidents such as caused by a distracted driver. To determine the cause, we need to collect and analyse relevant data generated by the vehicle. A vehicle has a lot of data originating from the In-Vehicle Network (IVN) and its communication with the outside world via V2X. Currently a very limited amount of data is stored in the memory of Electronic Control Unit (ECU)s and the cloud. For a vehicle, to support forensic investigation, much more data need to be stored securely. However, ADF is a relatively new area and is not well-researched. In a systematic literature review over the area of ADF [2], Strandberg et al. mention var- ious challenges. For instance, the lack of common and necessary industry guidelines and standards has made ADF difficult. Furthermore, no standardized data formats or interfaces exist in vehicles, which makes data collection and extraction difficult. 1 1. Introduction 1.2 Concepts Digital forensics has been investigated by many scholars [3], [4], [5]. Still, ADF is relatively a new area compared to DF because of its characteristics and distinct challenges. To familiarize readers, we are going to introduce some concepts below. 1.2.1 Automotive digital forensics ADF is defined as a branch of digital forensics relating to recovery of potential evidence stored in automotive modules, networks and messages sent across operating systems [6]. There is a large volume of data exchange while the vehicle is running. Today’s vehicles are equipped with multiple sensors, such as GPS, cameras and radar. The ECUs handle the input from sensors, whereafter signals are sent to actuators to respond with different actions like braking, steering or acceleration. It is crucial that ADF evidences should not be altered. Thus, integrity is imperative. In general, a "CIANP" model is presented in [2] to meet the security requirement for forensics evidence data, where the capital letter are abbreviations for Confidentiality, Integrity, Availability, Non-repudiation and Privacy. In ADF, specific steps of carry- ing out forensics may vary between different models, but it typically contains four steps: data collection, data analysis, data preservation and documentation. In the data collection phase, data is collected from various data sources existing in, e.g., IVN, V2X and the cloud, whereafter the data is filtered based on specific criteria concerning the investigated crime. The next step is to analyse the data to determine if it can be potential evidence. Automated machine-learning based approaches can be used to filter out forensically relevant information from the collected data [7]. Once the evidences are identified, the next phase aims at preserving evidences in a forensically sound manner, i.e., preserving the integrity of evidence [8]. In the last phase, a final report is generated from the results of the previous phases. Finally, the documentation can be used and presented in relation to a crime in the court of law. From a forensic perspective, the "CIANP" model is interpreted as follows. Confi- dentiality means that data shall only be accessed by authorized entities. Integrity ensures that the data is not tampered within the forensics process. Availability means that data remains available even in the event of a crash or other unexpected situations. Non-repudiation refers to the property that the occurrence of a certain event and its origin cannot be denied [2]. User’s personal data such as GPS location, call logs and contacts are highly sensitive from a privacy perspective. The privacy property ensures that the data is well protected and not disclosed to unauthorized individuals. When the security properties mentioned above are fulfilled, the data can be trusted to be authentic in a court of law. 1.2.2 Hardware and protocols Modern vehicles can have over 150 ECUs and contain more than 100M lines of code [2]. Automotive hardware include sensors, actuators and ECUs. Software consists of 2 1. Introduction applications used to implement automobile’s functionalities, which can be installed and running in ECUs. Typically, the in-vehicle software provides functionalities such as braking and interaction with the infotainment system. Memory chips in ECUs are ideal data sources for ADF. Besides ECUs, the infotainment system is of particular interest for investigators to examine. The infotainment system delivers information and entertainment via the dashboard touchscreen [9], which contains forensically useful information such as synchronized data from paired devices. Another device that contains useful data is the Event Data Recorder (EDR) [9]. In previous research about data storage investigation in vehicles and significance analysis, the EDR data was emphasized [10]. EDR, sometimes referred to as an automotive black box, is triggered by an event and used to record information related to crashes and accidents in a tamper-proof manner [8]. In order to access the data stored in the hardware for debugging and forensics pur- poses, interfaces are used. In the automotive case, On-Board Diagnostics-II (OBD-II), Joint Test Action Group (JTAG), USB, WiFi, Bluetooth are examples of existing interfaces. OBD-II, has been introduced to check the real-time parameter of all the electronic control units [11], and to record any abnormal behavior or malfunction in the system components [12]. In vehicles, there are mainly four kinds of communication buses: Controller Area Network (CAN), Local Interconnect Network (LIN), FlexRay and Media Oriented System Transport (MOST). Their functionalities are described as follows: 1) CAN The most common bus for data exchange and diagnostics. 2) LIN Used for low-speed and bandwidth applications, e.g., doors and sliding windows up and down. 3) FlexRay Used for safety critical and high-speed messages, e.g., vehicle stability control and embedded sensors. 4) MOST Used for high-speed and bandwidth multimedia related applications, e.g., mu- sic/video streaming and vehicle cameras. Lacroix et al. [13] visualize the vehicle architecture and mention that CAN is most important since it is the backbone network in a vehicle. It forwards all traffic that needs to be relayed between the sub-networks, and is an essential source for ADF since it can contain relevant error messages. However, it is a low level protocol and does not support security features such as encryption. Thus, most applications use their own security mechanisms [14]. 3 1. Introduction 1.2.3 Regulations and standards In general, ADF goes hand in hand with automotive cybersecurity, since securing the process of evidence data handling is of significant importance. There are several regulations and standards related to ADF both in Europe and around the world, which we will briefly introduce below. ISO/IEC 27037:2012, provides guidelines for specific activities in the handling of digital evidence, which are identification, collection, acquisition and preservation of potential digital evidence that can be of evidential value. Digital evidence should be obtained with an acceptable method with maintained integrity. This standard also provides general collection guidelines for physical evidence which could be helpful for managing the digital evidence [15]. The ISO/SAE 21434:2021 "Road Vehicles - Cybersecurity Engineering", jointly developed by SAE and ISO, is a guideline for secure automotive software and hard- ware development, and a standard for cybersecurity in the automotive industry. It enables organizations to define cybersecurity policies and processes, manage cyberse- curity risks, and foster a cybersecurity culture [16]. UN R.155, covers the uniform requirements of automotive cybersecurity and cyberse- curity management systems. This regulation is closely mapped to the requirements laid out in ISO 21434. General Data Protection Regulation (GDPR), applicable as of May 25th, 2018 in all member states to harmonize data privacy laws across Europe [17]. It imposes obligations onto organizations or individuals when they collect personal data in EU. 1.3 Challenges ADF is still an immature area. Although there are some technical solutions and surveys in ADF, many challenges still exist and need to be addressed, e.g., in the following categories [2]: • Evidence data management, referring to such as data collection, data extraction, data storage. Lack of a dedicated device to store forensic data is one of the challenges. • Communication, referring to data transferred to the cloud, edge, fog or related to Vehicular Ad-Hoc Networks (VANETs), where bandwidth can be a problem because of the huge data volume. • Algorithms, including machine learning. Although machine learning algorithms produce good results in data processing, they take considerable computation power. • Software and hardware used in forensics, such as forensic tools and vehicle architec- ture design, sensors, EDR. However, many tools or software have their own data formats, and it is a great trouble to use multiple tools from different companies. 4 1. Introduction • Cryptography, including blockchain. Although encrypting data improves data security level, it is still a trade-off between security and time. • General solutions, such as forensic process and guidelines, but the guidelines are not standardized. The challenges we face in ADF are broad, and in our research the following ones are highlighted: • When producing vehicles, many manufacturers prioritize usability and cost rather than fulfilling security properties, which increases the risk of cyber attacks. • The increasing complexity of vehicle architecture exposes vehicles to more attacks. • Increase in data volumes. Huge data volume could be generated both inside and outside of the vehicle such as data originating from the infotainment system and via communication with the cloud. This makes data storage, extraction and management difficult. • Various data formats and interfaces. Lack of standardized data formats and interfaces in vehicles make the evidence data collection and analysis difficult because of the low performance. • Security properties. The integrity and reliability of ADF data need to be ensured, i.e., the data must be authentic and tamper-proof. • Privacy concerns. Data from cameras, recorders and GPS may involve privacy issues. 1.4 Purpose Our ultimate goal is to provide guidelines and mechanisms for ADF. However, we put our emphasis on identifying the requirements for ADF at first, and then list focus categories corresponding to these requirements as follows: • Forensic mechanisms. Investigate what types of mechanisms exist in current vehicles regarding forensics investigations, including a potential standardized forensic guideline or process, and consider existing gaps for a forensic model. • Similar areas. Investigate similar areas that might apply to automotive, such as IoT, avionics, trains and smart cities. • Security properties. For example, by what means we can ensure the authenticity and admissibility of the collected and stored data. • Data extraction. Discuss and evaluate the challenges of transforming the collected evidence into data that a human investigator can interpret. Additionally, consider what measures may potentially influence the data (e.g., removing the power supply may result in losing volatile data). • Data verification. Once data is extracted, investigate how to prove the data is authentic so it can be admissible in the court of law. 5 1. Introduction Therefore, we consider the following countermeasures regarding the aforementioned challenges. For instance, • Investigate what has been done in ADF and other areas, to determine a general guideline. • Given existing solutions, identify what is currently lacking concerning ADF. • Propose an approach to ensure the CIANP properties. 1.5 Thesis outline The rest of this thesis is organized as follows. We discuss related works in Chapter 2. In Chapter 3, we present the methods used when conducting the research. Following that in Chapter 4, we demonstrate the detailed steps, i.e., the results of our forensic model. In Chapter 5, we discuss various issues such as innovations, limitations and ethical concerns. In the last Chapter 6 we present conclusions and finally ends with future work. 6 2 Related work Although ADF is still an immature area, some technical solutions exist. In this section, related work presented and classified into four categories, whereafter the gaps are analyzed. 1) Guideline As mentioned in [3], [4], [18], one challenge is that there is no general foren- sics framework or guideline. Regarding this problem, Buquerin et al. propose a generalized approach for ADF [5]. Their solution comprises four steps: Forensic Readiness, Data Acquisition, Data Analysis and Documentation. In the first step, they identify available data sources, tools and data extraction techniques. Then they choose a specific tool and interface to implement the data extraction process. At the end of the second step, the data acquisition, the extracted data need to be duplicated and the original data should be stored in a tamper-proof way. All later actions are performed on the duplicated data set. In the third step, the data analysis, the most relevant data are filtered out and the investigators can establish the evidence chain and crime timeline based on the data. Finally, in the forth and last step, they document all the previous results and create a final report. An example is given at the end of the article, which proves their approach to be feasible. Altschaffel et al. [19] propose another ADF model that has six steps: Strategic Preparation (SP), Operational Preparation (OP), Data Gathering (DG), Data Investigation (DI), Data Analysis (DA), Documentation (DO). The SP step refers to the forensic preparations done before accidents happening, while OP are foren- sic preparations after accidents happening. The rest steps are similar to those mentioned above. Moreover, they divide the forensic process into two categories: live forensic and post-mortem forensics. Live forensic focuses on extracting volatile data (e.g., data in main memory), while post-mortem forensics is carried out when the system is in power-off mode, which allows investigators to retrieve data on less volatile storage like hard disk. In [20], Sharma et al. show that ADF can be performed in two ways: reac- tive and proactive. The reactive approach is analogous to post-mortem forensics. The proactive approach consists of five phases: Proactive Collection, Proactive Preservation, Proactive Event Detection, Proactive Analysis and Report. In the first phase, data are collected using live forensics based on volatility and priority. 7 2. Related work Then data is preserved automatically. If there are any suspicious events detected in the third phase, they are analyzed and reported. Proactive forensics enable devices to record data prior to the accidents so that investigators are able to quickly determine the cause post-incident. 2) Architecture Davi et al. [21] describe an architecture for autonomous cars using blockchain technology. Traditional blockchain ensures data is accessible by authorized third parties, but does not guarantee integrity. This approach implements an in-vehicle shared ledger architecture to ensure data integrity, where each ECU works as a miner and shares information with all other ECUs. When a transaction (safety and security related message) is made, the ECU signs and broadcasts it to all ECUs. The receiver first verifies the signature, and propose a new block if a certain threshold of transactions is reached. In case that the verification fails, the transaction will not be processed further. Finally all ECUs update their copies of the chain by appending the new block. Their approach is helpful for ADF but fails to meet the real-time requirement of a safety-critical system. Lacroix et al. [13] introduce vehicle hardware and architecture in a compre- hensive way. Firstly, the authors introduce different buses (CAN, LIN, FlexRay and MOST) and state that CAN is the core bus that links all buses together. Then infotainment system is discussed. Infotainment systems like Ford SYNC, BMW Assist, Lexus Enform are interfaces to the end users. They provide safety-related (lock/unlock the door) or entertainment (streaming services) functionalities, thus containing a lot of useful information from a forensic perspective. Next, the au- thors present challenges of ADF, including mobility, topology changing, unreliable channels and multi-hop communication issues. Finally, they give an example of Ford SYNC physical dumps and analyze the information it contains. 3) Software and service Researchers have presented several applications used for ADF evidence data management and communication, for example in [22] and [23]. These two solu- tions are both based on the use of a 3-axis accelerometer together with other in-vehicle devices and modules, to track the vehicle location, detect accidents and provide the road condition updates. When the accelerometer’s G-value (centre of gravity) on the three coordinate varies, it is denoted that the vehicle faces sudden change in the acceleration. A tracking system combines the smart phone application with microcontroller which embedded with an acceleration sensing module is developed in [22]. Based on the information of G-value changes, road condition detection and vehicle location tracking can be implemented. In [23], a wireless black box using MEMS accelerometer and GPS tracking system is developed for accident monitoring. Additionally, this application can send out emergency messages to appropriate recipients when accidents happen. However, both solutions suffer from a lack of security and privacy considerations, which is critical for ADF. 8 2. Related work As investigated in [4], [13], [24], there are several forensic tools for digital forensics, such as Encase, Accessdata Forensic ToolKit, Xways Forensic, etc, but not many choices on ADF. From a previous survey in [4], there are very few ADF-specific tools, Bosch CDR and Berla iVe are two of them. The Bosch CDR Diagnostic Link Connector (DLC) Base Kit is an entry level kit which includes most components needed to retrieve EDR data directly from the DLC of many vehicles [25]. The Berla iVe Ecosystem is a collection of tools that supports investigators through- out the entire vehicle forensics process with a mobile application for identifying vehicles, a hardware kit for acquiring systems, and forensic software for analyzing data [26]. Besides the applications and tools, there are other choices, e.g., professional commercial companies which provide ADF services, like Digitpol [27] and Envista Forensics [28]. Digitpol’s services for ADF include: investigating infotainment, GPS and command systems, and identifying, capturing and analysing critical evidence stored in embedded OEM systems. Envista Forensics experts in data recovery, extraction and investigation from the infotainment system. 4) Data extraction A common challenge for ADF is that there is no standardized data format and interfaces, which makes data extraction and analysis difficult. Sladovic et al. state three different ways to extract data from vehicle’s internal system: connecting to the OBD-II port, umbilical-to-ECU and umbilical-to-EPROM [29]. The word "umbilical" means using a cable to directly connect to a device. Connecting to the OBD-II port is a straightforward method, but it is worth noting that the data retrieved here are DPID (Data Packet Identification Number) rather than actual data. Besides DPID, logs and fault codes are also accessible via OBD-II. For data security reasons, there is a special mode and security mechanism to gain extended privileges that enables retrieving relevant data. For the second method, the investigator connects directly to the ECU with a cable. However, if not operated correctly, the ECU would alter or even wipe the data for protection purposes. Thus it is not forensically sound and data integrity is not guaranteed. Umbilical-to-EPROM requires physical disassembly of the printed circuit board (PCB). It enables investigators to retrieve raw binary data in hexadecimal format but is time-consuming. Having discussed the related work as above, we conclude the following gaps. First of all, although several ADF frameworks or guidelines have been presented, they look similar to some extent but are still not unified. The lack of standardized guidelines in automotive industry is a problem. Secondly, no dedicated device storing forensic data is an issue. Furthermore, issues of data integrity have not been well addressed. Interfaces like WiFi and Bluetooth expose the internal system to external users as well as attackers [30]. Therefore, an approach for ensuring data integrity is needed. Finally, for data management, the obstacle of no standardized data format poses challenges for data extraction and interpretation, and can even leads to volatile data 9 2. Related work loss. In the following Chapter 3, we first present how we perform the literature review and the evaluation criteria. To address the before-mentioned gaps, we then introduce two relevant techniques: Blockchain and Cyber-investigation Analysis Standard Expression (CASE). For blockchain technology, since each block is linked by the hash value of its previous block, nodes can detect alteration of blocks, thus ensuring data integrity (I). Another feature of blockchain is that each node in the network keeps a copy of the main chain, and as long as more than half of the nodes are available and stay honest, the availability (A) and non-repudiation (N) properties can be guaranteed. However, ensuring "C" and "P" properties may require other approaches like encryption and data classification. CASE is an annotation language and has the ability to unify different data formats in a key-value pair format. 10 3 Methods Our approach is divided into four steps as follows. First, we performed a literature review that includes 36 papers and 2 databases. Second, we analyzed digital forensic approaches in similar areas. Third, we described and evaluated the techniques used in our solution. Forth, and finally, the evaluation criteria is discussed. 3.1 Literature review We followed the same approach as Strandberg et al. [2], where we first reviewed papers from different databases, and then perform Snowballing on these papers to gain additional papers. We first reviewed 33 papers from Google Scholar and IEEE Xplore. We used the following search strings, automotive digital forensics, IoT digital forensics, avionics digital forensics, railway digital forensics and smart cities digital forensics. By performing Backward and Forward Snowballing, we then obtained 3 additional papers closely related to our research. Most of these papers are technical solutions on ADF or similar areas, and some of them are surveys. They are later evaluated using Technology Readiness Levels (TRLs) at the end of this chapter. TRLs are a method for estimating the maturity of tech- nologies during the acquisition phase of a programme. It is based on a scale from 1 to 9, where 9 is the most mature technology, defined as follows [31]: • TRL1: Basic research • TRL2: Technology concept is formulated • TRL3: Experimental proof of concept • TRL4: Technology confirmed in lab • TRL5: Technology validated in relevant environment • TRL6: The technology demonstrated in relevant environments • TRL7: System prototype demonstrated in operational environment • TRL8: System complete and confirmed • TRL9: The system proven in an operational environment 11 3. Methods 3.2 Similar areas Digital forensics is not limited to automotive, there are other similar areas such as IoT, avionics, railway and smart cities, that have also taken digital forensics into consideration. In this section, we introduce similar areas and compare them to the automotive domain, with the aim to find their possible applicability for ADF. 3.2.1 IoT DF that is a research hotspot in IoT, also faces multiple challenges. IoT devices usually are connected and communicate with other entities. The devices are mostly heterogeneous, i.e., have diverse data sources, and results in various file systems and interfaces. Additionally, the devices can communicate using different communication protocols, such as HTTP, Bluetooth and NFC. Furthermore, the topology of IoT systems may be dynamic because the entities may move. All these characteristics make DF in IoT challenging. The IoT stores and transmits massive amounts of data between different devices. IoT DF can gather evidence from a variety of sources, such as sensors, communication devices, drones, smart home devices, cloud storage and vehicles. Thus, multiple DF approaches are required. The challenges of IoT DF include the increasing number of forensics entities, identify its relevance, blurry or non-existed network boundaries. As investigated in [32], although there are already many technical solutions that almost involve all aspects of IoT DF, it is still a field that needs hard work. Blockchain technology is considered suitable for IoT DF due to its immutable and distributed characteristics. As surveyed in [33], [34], [35] and [36], where a number of blockchain- based solutions have been presented. IoT relates to the automotive area not only because of the increasing use of IoT devices in vehicles [37], but also similar features between them such as being mo- bile and distributed. IoT-based DF analysis may lay a foundation for the forensic soundness and reliability of digital forensic processes in automotive systems [38]. 3.2.2 Avionics In avionics, there is a device called flight recorder, often referred to as the name of "black box". There are two types of flight recorder, the flight data recorder (FDR) and the cockpit voice recorder (CVR). The outer casing of the flight recorder is designed to withstand transient damage from harsh environments, is made of special material and is painted bright orange. It is installed in the safest position of the avionics. The flight recorder can record various parameters during the flight, such as flight time, speed, altitude, temperature, and even the dialogue between the pilot and the crew. In the event of an aircraft accident, finding the flight recorder and reading out the recorded data can help investigators performing digital forensic to determine the cause of the accident. 12 3. Methods Flight recorder technology in avionics is a mature field with a history dating back to 1950s. In recent years, there are some emerging techniques in avionics digital forensic also interested us, such as replacing the FDR and CVR with a system that provides aircraft data monitoring to support tracking of aircraft and data archiving technology from a space based platform in [39], and digital forensic on new generation aircraft [40]. With the development of technology, the flight recorder is no longer limited to be used in avionics, but can be widely used in variety of fields, such as missiles, rockets, trains, as well as within automotive digital forensics. An estimated price for a flight recorder is around $60,000, which is reasonable for an aircraft but expensive for a vehicle. So from a technological point of view, the flight recorder is suitable for automotive, but from an economic point of view it is not so ideal. 3.2.3 Railway Railway is considered as a closed safety system and runs in a non-networked environ- ment. Because of these characteristics of railway systems, there don’t seem to be many serious attacks to analyse from a DF perspective. As a result, DF analysis of the railway system has not received much attention. Like other fields, railway is on the road of digitization and become increasingly connected to the Internet. Thus, there is also a risk for cyber attacks on the railway systems, potentially with disastrous outcomes. The above indicates that DF in railway systems is imperative. Still, not much work has been done. J. Cosic et al. have published two papers outlining the chal- lenges and the investigation process in railway DF [41], [42], respectively. No other specific technical solutions have been presented. In addition, railway system is complicated and distinct, consists of multiple components with particular functions. Furthermore, railway is a centralized system with central control in comparison vehicles are distributed and communicate via V2X. As a result of existing work within the railway system, it seems to have very limited value for ADF. 3.2.4 Smart cities Smart cities is emerging and refers to a technologically modern urban area that uses different types of electronic methods and sensors to collect specific data [43]. The U.S. National Institute of Standards and Technology (NIST) proposed a model for smart cities that comprises six components: government, economy, mobility, environ- ment, living, and people [44] [45]. Baig et al. further divide smart cities into four categories: Smart Grids, Building Automation Systems (BAS), Unmanned Aerial Vehicles (UAVs) and Smart Vehicles, corresponding to environments, living, mobility and mobility. Smart grids gather and analyze the data about energy consumption pattern and provide a more flexible power supply. BAS controls the devices inside buildings and provide services like heating, ventilation and air conditioning. UAVs or drones have wide applications such as package delivery and coastline patrol. Future Smart vehicles will have more entertainment and diagnostic functionalities. 13 3. Methods Although smart cities have greatly improved life quality, there are vulnerabilities that can be exploited by attackers. For example, smart grids make use of thousands of smart meters to collect power usage data and upload them to the cloud for storage and analysis purpose, which opens the possibility of attacking. To mitigate the consequences of such events, DF has become an important topic in smart cities, and it has a lot in common with ADF. First of all, smart cities and vehicles have similar topology in the sense that everything is distributed and interconnected. Therefore, when performing forensics, investigators may acquire data from multiple devices instead of one particular device. Additionally, nodes status is changing, i.e., nodes are constantly joining or leaving the network, making it hard to determine the system’s state. Lastly, smart vehicles are within the scope of smart cities, thus DF for smart cities and ADF have similar characteristics. 3.3 Techniques In this section, we are going to introduce the specific techniques used in our solution. Blockchain is commonly referred to within the field of ADF and is highlighted as a promising approach due to its characteristics. CASE is a community-developed specification language that aims to advance the exchange of cyber-investigation information between tools and organizations [46]. We consider blockchain and CASE as the fundamental techniques in our solution. 3.3.1 Blockchain When people are shopping online, traditional payment system typically involves three parties: the buyer, the seller and a trusted third party (e.g., a bank). Although the third party solves trust issues, it introduces extra cost for both buyers and sellers. The emergence of blockchain technology is to ensure secure payments without a trusted third party. A blockchain is a distributed ledger with growing lists of records (blocks) that are securely linked together via cryptographic hashes [47], where each block contains several transactions information, and each node in the distributed system keeps a copy of the main chain. Due to providing traceability in such an approach, it is also ideal for storing forensic information. According to the bitcoin white paper [48], the blockchain system has three core components: a timestamp server, a proof-of-work system and an incentive system. It implements an implicit timestamp server by taking a hash of the previous block. The timestamp server proves that the data must have existed at the time in order to get into the hash [48]. When there are new transactions needed to be recorded, all nodes start working at the same time to find a new block. Thus, a proof-of-work system is necessary to determine which node finishes the work first. One of the approaches is to scan for a hash value that fulfills a particular pattern. The first node who has found a new block will receive a certain amount of currency as a reward. Such incentive system encourages nodes to keep working on finding next block. The blockchain technology can be applied to ADF. As mentioned above, establishing 14 3. Methods an event timeline is a crucial step in forensic investigations. The timestamp server plays an important role since it can record information chronologically, which ensures non-repudiation property. The proof-of-work system is helpful in proving that a node has done the work and thus ensuring authenticity. Every vehicle is a node in the blockchain network in our solution. For the incentive system, it motivates the nodes to keep storing forensic information since the sensors on vehicles produce data all the time. Another feature of blockchain is that it ensures data integrity. The whole system is not considered compromised as long as at least 50% of the nodes are honest because of its one-CPU-one-vote nature [48]. The process of computing hash is comparable to voting. In traditional one-IP-address-one-vote mode, an attacker can fake many IP addresses to have multiple votes. For one-CPU-one-vote mode, however, it is very unlikely that a limited number of individuals take control of over 50% of the nodes. Moreover, blockchain is based on distributed systems. All the running vehicles form a distributed system, where each vehicle runs as a separate node. Thus, blockchain technology is well suited for ADF. 3.3.2 Cyber-investigation Analysis Standard Expression Evidence data for ADF can be collected from various data sources, such as components in the vehicle, entities communicating over V2X, and the cloud. The heterogeneity of devices can lead to different data formats. Similar devices, for example, ECUs commonly runs different operating systems and uses different data formats [2]. Furthermore, data formats also vary across numerous vehicle brands and models. The lack of a standardized data format is a critical challenge for ADF. For example, there are two types of message on the CAN network. The standard one supports a length of 11 bits for the CAN identifier and the extended one supports a length of 29 bits for the CAN identifier, as shown in Fig 3.1 and Fig 3.2 [49]. LIN message frame consists of a header and a response as shown in Fig 3.3 [50]. Figure 3.1: Standard CAN message frame Figure 3.2: Extended CAN message frame 15 3. Methods Figure 3.3: LIN message frame There are solutions that proposes a common format. One typical example is the Navigation Data Standards (NDS). NDS is a standardized format for automotive navigation databases that aims to develop a standardized binary database format for the navigation data exchange between different systems [51]. NDS solves part of the problems, but a more general solution is still needed. From a forensic perspective, no matter where the data is extracted from or what protocol is used, a common and standardized data format is essential for the implementation of the ADF process. As a community-developed specification language, CASE utilizes JavaScript Object Notation for Linked Data (JSON-LD) to serialize forensic information. A complete CASE representation consists of multiple Objects, where each Object is a collection of key-value pairs. It provides users several types of objects such as Identities, Relation- ship, Action, Investigation, Roles, Traces, Location, Annotations and Tools. Objects are uniquely identified by a 128-bit label called Universally Unique Identifier (UUID). List 1 illustrates how CASE represents information. It contains a propertyBundle which defines it as a computer. The first "@type" implies that it is a "Trace" type of object. In the "propertyBundle", each "@type" and its associated information reveals the "Device" in more detail. This annotation gives a complete picture of the Device. During the forensic process, a large volume of data is usually gathered, analyzed and preserved from different devices and stakeholders. As previously highlighted, having a general approach to represent information is necessary. CASE is a common format intended for expressing and exchanging cyber-investigation information [46], and digital forensics is one of the specific interest domain for CASE. Therefore, we consider that CASE is suitable for ADF investigations due to its high flexibility, usability and semantics. 16 3. Methods 1 { 2 "@context": { 3 "@vocab": "http://case.example.org/core#", 4 "olo": "http://purl.org/ontology/olo/core#", 5 "acme": "http://custompb.acme.org/core#" 6 }, 7 "@graph": [ 8 { 9 "@id": "forensic_lab_computer1", 10 "@type": "Trace", 11 "location": "forensic_lab1", 12 "propertyBundle": [ 13 { 14 "@type": "Device", 15 "manufacturer": "Dell", 16 "model": "Inspiron 5000", 17 "serialNumber": "D1234567" 18 }, 19 { 20 "@type": "OperatingSystem", 21 "name": "Windows 7 Ultimate Edition", 22 "manufacturer": "Microsoft", 23 "version": "6.1.7601 Service Pack 1 Build 7601" 24 }, 25 { 26 "@type": "ComputerSpecifications", 27 "bios": "E1762IMS.10M", 28 "cpu": "Intel Pentium i7", 29 "ram": "4GB" 30 }, 31 { 32 "@type": "NetworkLocation", 33 "domain": "dfl.local", 34 "ipAddress": "192.168.1.145" 35 }, 36 { 37 "@type": "acme:InventoryComputer", 38 "name": "DFL-03", 39 "inventoryNumber": "10503" 40 } 41 ] 42 } 43 ] 44 } Listing 1: A Device represented using CASE 17 3. Methods 3.4 Evaluation Our intention is to analyse the usability of the CASE format aligned with blockchain technology. In order to do this, in the following subsections, we first use the TRLs to evaluate them and then carry out a gap analysis. 3.4.1 TRLs of technical solutions We go through the technologies presented in ADF and similar areas mentioned above and evaluate them using TRLs. The results are divided into five main categories based on the areas of Automotive, IoT, Avionics, Railway and Smart Cities. For each technical solution, we identify its readiness as shown in Table 3.1. In the Applicability Level (AL) column, denotes applicable, H# partially applicable, # not applicable. 3.4.2 Gaps analysis Although these techniques have solved many forensic problems, gaps still exist. For blockchain, one of the most important issues is timing and spacing overhead. The process of finding a new block involves computing hashes of a particular pattern, thus consuming considerable computation power and time. Similarly, it may require excessive storage space, because each node in the blockchain network is responsible for keeping a copy of the main chain. Therefore, timing and spacing overhead has become a main challenge for blockchain used in ADF. Moreover, an incentive/punishment mechanism is needed to ensure security [8], which is the motivation for vehicles in the networks to join and contribute to the blockchain. The blockchain technology described in [48] has an incentive system based on giving digital currency to par- ticipants. However, it does not work in ADF since there is no currency involved at all. In our case, it is an implicit incentive mechanism in the sense of "one for all, all for one". Each node records data for the benefit of all nodes. If the node itself is involved in an incident, it can also be quickly served. Thus, having an incentive mechanism can motivate vehicles to engage in forensic investigations. Using CASE to represent information benefits forensic investigations. Firstly, dif- ferent data being expressed in a unified format greatly reduces forensic complexity. For example, multi-jurisdiction is a common problem encountered in cross-border forensics. A single file can be even broken down into multiple blocks storing in different locations with different regulations [32]. In such case, a unified data format will improve cooperative investigation efficiency. Secondly, CASE supports various types of Objects. An investigation, a device, a file or even a location can be expressed by CASE. Such feature enables a detailed representation of information. Further- more, each object is identified by a unique UUID, making it easier to refer other objects. However, no method exist that automatically converts other data formats into CASE. 18 3. Methods Table 3.1: TRLs of technique solutions Area Ref. Technology TRLs AL1 2 3 4 5 6 7 8 9 Automotive [2] Survey [10] Hardware [8] Blockchain [9] Introduction [52] Framework [53] Data Extraction [13] Data Extraction [30] Framework [21] Framework [20] Review [19] Survey [5] Framework [29] Data Extraction [18] Data Analysis [3] Introduction [4] Data Analysis [54] Framework [7] Data Extraction [23] Software [22] Software [24] Framework IoT [32] Survey H# [38] Survey H# [37] Survey [36] Blockchain H# [34] Blockchain H# [33] Blockchain H# Avionics [39] System H# [40] System H# Railway [41] Framework # [42] Introduction # Smart Cities [44] Introduction # [45] Framework H# 19 3. Methods 20 4 Results In this chapter we describe the results in detail. The whole framework is introduced first with a table and a figure that illustrates the underlying mechanism, followed by each specific steps. At the end, a results document is designed to record all the relevant data and help the reader better understand how each procedure works. 4.1 Forensic lifecycle Forensic lifecycle refers to the procedures required for the completion of a forensic investigation. We propose six steps, namely 1) Data Collection (DC), 2) Data Preprocessing (DP), 3) Data Preservation (DV), 4) Data Retrieval (DR), 5) Data Analysis (DA) and 6) Documentation (DO). Below we give an overview of what should be done in each step. Any investigation starts with the collection of information that aids in solving a case. For ADF, data can exist in vehicle, be transferred V2X, or stored in various cloud sources. In the vehicle, sensors are vital components and can generate a wide range of data. For example, the GPS module records locations of the vehicle, which is of great importance for forensic investigations. Other data sources such as ECUs and EDRs may also contain relevant data. The task of step (1) is therefore to collect data from various data sources, whereafter the data is filtered to identify the most relevant data, which is later uploaded to the cloud. Step (2) involves data preprocessing, including data classification and format unification. Data are classified into several categories according to different rules and then unified in CASE format. This allows investigators to perform ADF within a particular category. In step (3), the reduced and well-organized data set is stored in the cloud as a chunk with a unique ID, which is a concatenation of the timestamp and the hash of Vehicle Identification Number (VIN). In this scenario, the chunk refers to the pieces of data stored in the cloud, and is used to distinguish it from the block stored in the vehicle. The vehicles will then receive the corresponding IDs from the cloud, and the IDs are later used for data retrieval. As we clarified previously, each vehicle is a node in the blockchain network and can propose or erase blocks. The ID information and the hash of the previous block form a new block that will be proposed to the blockchain. The architecture of the vehicle blockchain network is shown in Fig 4.1. During step (4), if after a period of time, there are no accidents involving a specific vehicle, the blocks and chunks containing its data will be erased because of storage 21 4. Results limitation. Otherwise, the data should be retrieved and verified for further actions, e.g., subsequent analysis in step (5). Finally, in step (6), all steps and analysis results are documented. Moreover, each step is further divided into several sub-steps. Table 4.1 summarizes the tasks of each step and Figure 4.2 visualizes the whole process in detail. Figure 4.1: Vehicle blockchain network architecture 4.2 Data collection The goal of this step is to collect all data pertaining to a specific vehicle and store them in the cloud. During the entire process, however, many factors can influence the success of data collection, such as stakeholders, data sources and tools. We mainly need to address two questions in Data collection phase: where to collect and how to collect. It comprises three sub-steps: Preparations, Feasibility analysis and Gathering. Firstly, identify the factors such as data sources, tools and interfaces in Preparations. Then determine if it is feasible to extract data in this context. Finally gathering all the data if feasible and upload to the cloud. We assume that the cloud is secure and cannot be compromised. 4.2.1 Preparations Preparations is the first sub-step of Data collection, where factors that may influence data collection are identified. Such factors include stakeholders, data sources, tools and interfaces. 22 4. Results Fi gu re 4. 2: A D F fra m ew or k 23 4. Results Table 4.1: ADF lifecycle Phase Description Data Collection (DC) Collect data from various data sources. Once the data collected, they are filtered to reduce the volume, and only the most relevant data are kept and uploaded to the cloud. Data Preprocessing (DP) Including data classification and format unification. Data are classified into several categories according to different strategies and then unified in CASE format. Data Preservation (DV) The data set is stored in the cloud as a chunk with a unique ID, which is a concatenation of the timestamp and the hash of VIN. The ID is then sent back to the corresponding vehicle and later will be used for data retrieval. The ID information and the hash of the previous block form a new block that will be proposed to the blockchain. Data Retrieval (DR) Two possibilities exist in DR. Blocks are erased to save space if no accidents occur after a period. Data will be read out from the cloud when accidents happen according to the chunk ID and verified for future use. Data Analysis (DA) The task of this step is to take the data as input, analyze them and generate an output (analysis result). It starts with identifying an approach, followed by analyzing the data using the approach. Then create a timeline for this accident based on the analysis result. Documentation (DO) All steps and analysis results are documented. Stakeholders Stakeholders are the intended audiences of the collected data, and their interests in different types of data can vary. For example, law enforcement agencies prioritize data that can be used as evidence in criminal investigations, such as GPS locations, video or audio recordings. Insurance companies, on the other hand, pay more attention to EDR data, which provides critical information for determining liability in accidents, such as brake, airbag and seatbelt status, as well as vehicle speed at the time of collision. EDR data can help insurers make informed decisions when compensating accident victims. For manufacturers, regular vehicle diagnoses are of great importance, as the diagnostic reports provide valuable insights for improving the vehicle’s performance and safety. On-vehicle service (e.g., navigation system) providers are primarily interested in user data to enhance user experience. Drivers themselves care more about dashboard data, such as fuel indicators, speedometers, and odometers, to monitor their vehicle’s performance. By identifying stakeholders and understanding their data needs, we can make better decisions about what data to collect and analyze, ultimately leading to better outcomes for all parties involved. Data sources The underlying mechanism of a vehicle can be summarized as follows: the ECUs continuously collect data from various sensors, analyze them, 24 4. Results and send instructions to actuators to control the vehicle. In case of a trigger event, such as a sudden change in speed, the EDR records the vehicle’s status prior to the event. Additionally, drivers can interact with the infotainment system to access valuable information from the internet. Throughout the entire process, sensors contain environmental data, while ECUs contain controlling data. EDRs are a source of safety-critical data, while user-generated data are stored in the infotainment system. Moreover, other vehicles, Roadside Units (RSU), the cloud, and mobile devices are all valid data sources due to V2X communication. We can significantly reduce complexity by identifying the appropriate data source based on the data type we require. Tools Data can reside in vehicle storage or internet. Prevalent forensic tools such as Forensic Toolkit (FTK), EnCase and autopsy extract the on-vehicle data by scanning the hard drive. Other tools like Wireshark can be used to monitor the data from internet. However, tools have limitations with respect to file systems. Various file systems exist in vehicle modules due to the differences in brands, OEMs and the models. For example, FAT, FAT32 and NTFS file systems in Windows embedded OS, QNX4 and QNX6 in QNX OS, HRFS and DosFS in VxWorks. QNX is considered as the most widely used file system in automotive industry for its helpful in building a safety, security and reliable automotive system, currently being used in more than 215 million vehicles, whereas most of the existing tools do not support the QNX file system. In general, forensic tools have their own applicable file systems, e.g., Table 4.2 from [4] shows the supported file systems of three forensic tools. Therefore, it is necessary to verify the suitability firstly and then choose an appropriate tool for data collection. Table 4.2: Tools and support file systems Support file system Encase 8.x Access forensics X-Ways 20.x FAT 12/16/32 NTFS EXT2/3/4 HFS+ UFS 1/2 QNX Interfaces As mentioned earlier, vehicles have multiple interfaces, such as OBD-II, JTAG, USB, WiFi, and Bluetooth. It is important to choose the appropriate inter- faces based on the specific needs. OBD-II is primarily used to access diagnostic data, while JTAG is used to access test data. If we want to acquire external data, WiFi or Bluetooth may be better options. For most scenarios, a combination usage of multiple interfaces will be concerned to meet the forensics requirements. One approach to identify factors is to focus on specific scenarios. For example, if a crash occurs on a BMW vehicle, we can identify the stakeholders, data sources, tools, and interfaces as follows. Firstly, the traffic police may investigate the cause 25 4. Results of the accident, and the insurance company may need to provide compensation. The driver, traffic police, and insurance company are all stakeholders. Secondly, the accident could have been caused by a variety of factors, such as driver’s misopera- tions, environmental factors like fog or slippery roads, or vehicle system failure. In this case, the EDR is the most important data source to consider, with the ECU being an auxiliary data source. Since the BMW vehicle uses the QNX operating system, the X-Ways Forensics tool is a good candidate because it supports QNX [55]. Additionally, the OBD-II interface may be useful in extracting data. Then we can move on to the next sub-step. 4.2.2 Feasibility analysis Feasibility analysis is an important step in the sense that we might abort the forensic process if it has too much time and economic costs. We will investigate this issue from three aspects: forensic difficulties, time cost and economic cost. As discussed in Chapter 2, connecting to OBD-II, connecting to ECU and physical chip disassembly are three common methods to extract data from vehicles. They differ in terms of forensic difficulties and costs. Extracting data via OBD-II port only requires a cable, a customized device and a customized software from the manufacturer. The OBD-II port is standardized and each pin has a distinct meaning, which enables fast and accurate diagnoses. Extracting data directly from ECU can be more challenging since it may require professional software to dump the entire file system from the ECU. An article [56] has compared three forensic software: Forensic Toolkit (FTK), EnCase and X-ways in price and performance, where performance is expressed by the time spending on searching a specific string, as shown in Table 4.3. For individual researchers, X-ways is a better choice. Table 4.3: Comparison of three forensic tools Cost Encase FTK X-ways Price (per year) $3500 $3000 $1000 Time for string searching 3m31s 3h56m3s 1m52s Physically chip disassembly also requires hardware experts besides a professional software. It has the highest costs but is the most comprehensive method of ex- tracting data. Table 4.4 summarizes the characteristics of each approach and their corresponding costs. 26 4. Results Table 4.4: Characteristics and costs of extraction methods Evaluation Connecting to OBD-II Connecting to ECU Physical chip disassembly Forensic difficulties Easy Middle Hard Time cost Low Depends on the tools High Economic cost Low High High Summarize Accurate Challenging Comprehensive 4.2.3 Gathering In the Preparations sub-step we have addressed the question of where to collect the data by identifying the data sources, and in this Gathering sub-step, we aim to address the question of how to gather the data. Various data sources make data gathering challenging, not only because of the large number of data sources related to large number of devices and entities, but also because of the heterogeneity of these entities. Data should be gathered and uploaded to the cloud in a forensically sound manner to ensure the integrity and confidentiality. To meet the security properties, the cloud is managed by a trusted third party, and the data are updated and stored the data in a cryptographic way. In general, there are several levels of data extraction strategies in ADF: Network Level, Board Level and Chip Level [53]. Due to its high usability and ability to be performed automatically without damaging vehicle modules, Network Level extrac- tion is the most commonly used method and is usually sufficient in most scenarios. As our approach is based on blockchain implemented in a distributed V2X network, we primarily focus on Network Level data extraction. This method uses specific manufacturer software to gather data from the multiple modules of the IVN. The ECUs continuously collect data from various sensors, and the infotainment system stores the user-vehicle interaction information. This data are huge in volume and needs to be filtered first and then uploaded to the cloud. After filtering, the data that is not relevant to forensics is filtered out. This reduces the volume of the data, resulting in lower storage costs and higher performance. The gathered data stored in the cloud is preprocessed in the next sub-step. To implement data gathering, we can utilize software to listen on CAN buses and other communication channels. 4.3 Data preprocessing Finding valuable information from the raw data is a rather difficult task because they are uncategorized and chaotic. Unifying data formats and reducing data complexity have become essential steps. We tackle this problem with two sub-steps: Classification and Format unification, which address complexity and data formats issues respectively. 27 4. Results 4.3.1 Classification Data classification is a process that divides a data set into different categories based on a standard. In our work, three classification standards are identified: data sources, confidential levels, and the contents, as shown in Figure 4.3. 1) Firstly, data can originate from IVN or V2X. However, this data source based method is imprecise and requires further sub-categorization. In IVN, sensors, EDR and infotainment are all possible data sources. V2X contains sub-categories like V2D (Vehicle-to-Device), V2G (Vehicle-to-Grid) and V2N (Vehicle-to-Network). 2) Confidentiality level is another standard for data classification. There are four levels in general: Restricted, Confidential, Internal and Public. This classification method helps the investigators determine the sensitivity level of data. The data with the highest confidentiality level play the most important role, and special attention should be paid to the protection of such data. 3) Since the obtained data are all forensics-related data after filtering, the data can be classified into three categories in terms of contents: safety-related data, security-related data and personal data. Safety-related data primarily involves safety incidents caused by, for example, hardware or software failure or driver’s misoperations. The data falling in this category include the accelerator, the status of the safety belt and the speed. Security-related data are the information related to the security events or modules of the vehicle, e.g., the malicious code implanted in the vehicle, the remote manipulation of the vehicle, or the vehicle’s Intrusion Detection System (IDS). Personal data are the user data that involves the interaction between the stakeholders and the vehicle, such as the data originated from the smartphone. A supervised machine-learning approach is a good option to perform data classifi- cation. Supervised machine learning uses labeled datasets to train algorithms to predict results. In the context of ADF, this approach works as follows. 1) Prepare the dataset. The filtered data from the previous sub-step is the dataset in this phase. 2) Split the dataset into training and testing sets, for example, 10% training and 90% testing set using train_test_split() function in sklearn. The training set is annotated with the labels of safety, security, and personal, corresponding to the three categories based on the contents. 3) Train the model using several classification algorithms such as Random Forest, Support Vector Machines, K-Nearest Neighbours and Neural Networks. These algorithms will result in different accuracy, and the one with the highest accuracy will be selected. 4) Predict the results using the fitted model with the sklearn’s predict() function, and the result is the classification with the three categories, i.e., safety, security, and personal. 28 4. Results Figure 4.3: Data classify strategies After data classification, the data are presented in a clearer, more structured way and is easier to access, giving the investigators a better understanding of the data. It is also easier to identify the most relevant data about a specific incident. For instance, if an incident is caused by an attacker remotely controlling the car, more attention should be paid to the ECU category data. Therefore, a proper data classification can greatly improve efficiency of forensics. In addition, classification can protect the confidential and sensitive data, which makes the data more secure. The filtered and classified data will then be unified in CASE format in the next sub-step. 4.3.2 Format unification The last sub-step aims at standardizing the data formats. Due to the variety and quantity of on-vehicle devices, many problems can arise if there lacks an exchange- able data format. Firstly, data interpretation has become an intractable issue. For instance, service providers and manufacturers have their own data formats, and some insurance companies require drivers to carry a plug-in device that records their driving behavior [13], which has introduced additional data to be dealt with. In order to interpret the diverse data formats, specialized software are needed, leading to a slower forensic process. On the other hand, modern forensics is usually not ac- complished by a single person or authority, but a collaborative activity. For example, cross-border investigation refers to the investigation that involves multiple parties 29 4. Results (e.g., countries, jurisdictions). A unified data format is of particular importance here, as files may need to be reconstructed from multiple pieces held by different parties. In our work, CASE is applied to achieve this goal. Object and property bundle are two fundamental components of CASE, where object depicts an entity and property bundle contains the corresponding properties. Table 4.5 lists all types of objects and their descriptions [46]. Table 4.5: CASE objects and descriptions Object Description Investigation An exploration of the facts involved in a cyber-relevant set of suspicious activities PropertyBundle A group of properties characterizing a particular aspect of an object Identity Characterization of the identifying properties of an individual or organization Location A geophysical place, site or position Tool Characteristics of a tool used in a cyber context Relationship An association or link between two objects Annotation A statement asserted to be true in relation to one or more other objects Action Something that may be done or performed within the digital domain Trace A distinct article or unit within the digital domain ProvenanceRecord A provenantial connection between a forensic action and a set of observations A simple example is presented to show the usage of CASE. Suppose researchers obtain the following data in an investigation of a car crash: On May 6, 2020, at 14:32 pm, Bob was driving a BMW X1 from Stockholm to Oslo. However, at 16:05 pm, the car crashed in Örebro at the geographic coordinates of 59◦N, 15◦E. The video captured by a Garmin dash camera revealed that the road conditions and visibility were good at the time of the crash. A recent maintenance report indicated that all functions of the car were normal. However, the EDR data showed the car was traveling at a speed of 120 km/h and in 5th gear prior to the crash. Based on this information, a preliminary conclusion has been made that the accident was caused by speeding. In this scenario, various types of data, devices, stakeholders and judgements are involved, but they can easily be unified with CASE as follows: 1 { 2 "@id": "investigation-42dec83a-ec19-11ed-a05b-0242ac120003", 3 "@type": "Investigation", 4 "name": "Accident X", 5 "focus": "Crash", 6 "description": "In Örebro, Bob's BMW car crashed", 7 "object": ["Bob-uuid","car-uuid","dashcam-uuid", 8 "maintenance-report-uuid","EDR-uuid","location-uuid", 9 "forensic-action1-uuid","annotation1-uuid"] 30 4. Results 10 }, 11 { 12 "@id": "Bob-uuid", 13 "@type": "Identity", 14 "propertyBundle": [ 15 { 16 "@type": "SimpleName", 17 "firstName": "Bob", 18 "lastName": "Smith" 19 }, 20 { 21 "@type": "BirthInformation", 22 "birthdate": "2000-01-01" 23 } 24 ] 25 }, 26 { 27 "@id": "car-uuid", 28 "@type": "Trace", 29 "propertyBundle": [ 30 { 31 "@type": "Car", 32 "brand": "BMW", 33 "model": "X1" 34 } 35 ] 36 }, 37 { 38 "@id": "dashcam-uuid", 39 "@type": "Trace", 40 "description": "good road condition and visibility", 41 "propertyBundle": [ 42 { 43 "@type": "Device", 44 "brand": "Garmin", 45 } 46 ] 47 }, 48 { 49 "@id": "maintenance-report-uuid", 50 "@type": "Trace", 51 "propertyBundle": [ 52 { 53 "@type": "File", 54 "filePath": "\Bob\home", 55 "fileName": "maintenance-report.pdf" 31 4. Results 56 } 57 ] 58 }, 59 { 60 "@id": "EDR-uuid", 61 "@type": "Trace", 62 "propertyBundle": [ 63 { 64 "@type": "EDR", 65 "data": { 66 "speed": "120km/h", 67 "isBraked": "true", 68 "gear": "5" 69 } 70 } 71 ] 72 }, 73 { 74 "@id": "location-uuid", 75 "@type": "Location", 76 "propertyBundle": [ 77 { 78 "@type": "SimpleAddress", 79 "locality": "Örebro", 80 "region": "Sweden", 81 "postalCode": "70210" 82 }, 83 { 84 "@type": "LatLongCoordinates", 85 "latitude": "59.293257", 86 "longitude": "15.199069" 87 } 88 ] 89 }, 90 { 91 "@id": "forensic-action1-uuid", 92 "@type": "ForensicAction", 93 "name": "extracted", 94 "startTime": "2020-05-06T15:36:20Z", 95 "endTime": "2020-05-08T09:30:48Z", 96 "propertyBundle": [ 97 { 98 "@type": "ActionReferences", 99 "location": "location-uuid", 100 "target": "Bob-uuid", 101 "object": ["car-uuid","dashcam-uuid", 32 4. Results 102 "maintenance-report-uuid","EDR-uuid"], 103 "result": "annotation-uuid" 104 } 105 ] 106 }, 107 { 108 "@id": "annotation-uuid", 109 "@type": "Annotation", 110 "tag": ["forensic"], 111 "description": "The accident is likely caused by speeding", 112 "object": ["forensic-action1-uuid","EDR-uuid"] 113 } Listing 2: CASE representation of the scenario There is an overall Investigation object that refers eight sub-objects, where each sub-object stores some information. Specially, an object representing the forensic action and another representing the preliminary forensic result are shown. Referring other objects is accomplished by the unique UUID. Up to now, the collected data have been filtered locally resulting in a re- duced data set. The data are then uploaded to the cloud for classification and formatting, after which they will be preserved in the cloud as a chunk for further steps. 4.4 Data preservation A reduced and classified data set with a unified format is available from the previous steps. One approach for preserving the data is to have a dedicated device on vehicle. In our approach, however, they will be preserved in a permissioned blockchain, which provides more privacy as it is only accessible to members who have been granted permissions by the administrator. Nodes are permitted to perform certain actions by presenting certificates. A permissioned blockchain has the benefits of a blockchain, as well as the authority aspect of a centralized system. Nodes in the network are identified, rather than being anonymous in a permissionless blockchain. Since it is not publicly accessible, the contents of blocks are more transparent in the network among the members. In addition, permissioned blockchain is faster because of the partially decentralized structure and fewer nodes, resulting in better performance. Moreover, an important step in traditional blockchain is finding a proof-of-work, such as a hash with particular pattern. However, the real-time requirement of ADF make it not applicable. In our work, each vehicle proposes its own blocks and broadcasts. Table 4.6 summarizes the differences between permissioned and permissionless blockchain. In previous steps, data have been filtered locally and uploaded to the cloud for classification and formatting, and then stored as a chunk in the cloud. In this step, the cloud generates a unique ID for each data chunk by concatenating the timestamp and the hash of the VIN. This approach offers several benefits. Firstly, the ID is 33 4. Results Table 4.6: Comparison of permissioned and permissionless blockchain Property Permissioned Permissionless Privacy Accessed by members Open Decentralization Partially decentalized Totally centralized Anonymity Identified Anonymous Consensus Mechanism PBFT, FBA, RRC PoW, PoS, PoC Speed Faster Slower Performance Higher Lower guaranteed to be unique even if a vehicle generates multiple blocks with identical VINs, due to the uniqueness of the timestamp. Secondly, investigators can search for blocks within a specific time period by indexing the timestamps. Furthermore, the VIN is private information for vehicle owners, and hashing is an effective way to protect it. After the ID being generated, the cloud returns it back to the vehicle. The vehicle then proposes a new block containing the hash of the previous block and the received ID. Blocks are linked through hashes, and there is a one-to-one correspondence between a data chunk in the cloud and a block in the vehicle. The ID is similar to a pointer in terms of functionalities, i.e., they are both addresses and point to a location where the actual contents are stored in. It is then broadcast to all nodes over the network so that each node creates a copy of this block. Finally, if the hash is checked to be correct, the nodes accept the block by appending it to the top of the chain, reaching a consensus and continue waiting for new blocks. In case that the vehicle is disconnected, we have a buffer mechanism that can store the generated block temporarily. When it is online again, the blocks are sent out. Figure 4.4 illustrates the procedures of data preservation. 4.5 Data retrieval Two possibilities exist in this step. Blocks and chunks are erased to save space and reduce the load of the system if no accidents occur after a period of time (e.g., one month). Otherwise, the data will be read out from the cloud according to the chunk ID stored in the blockchain. 4.5.1 Block erasing Due to limited storage capacity of both the vehicle and the cloud, it is infeasible to preserve all blocks or chunks without deleting old ones. To estimate the required storage space, we conduct a simple calculation on blocks as follows. In traditional blockchain, a new block is generated every ten minutes, and we can apply this to our work as well. To further reduce the data stored in each vehicle, the V2X network can be divided into sub-networks that consist of a certain number of nodes , for example, 100000 nodes per sub-network. In this case, vehicles 34 4. Results Figure 4.4: Procedures of data preservation only need to store the blocks generated by those in the same sub-network, which requires less on-vehicle storage space. Suppose we use SHA-256 as hash algorithm and a 32-bit timestamp, the size of a single block is the sum of a 256-bit hash of the previous block, a 32-bit timestamp and another 256-bit hash of the VIN, i.e., 256 + 32 + 256 = 544bit/block. The average daily running time of a vehicle is assumed to be ten hours. The total space required for a sub-network with 100000 nodes in one month is 544/8 ∗ 10 ∗ 60/10 ∗ 30 ∗ 100000/10243 ≈ 11.4GB/month, which is not an issue for most of modern vehicles as they typically have much larger storage space. Therefore, it is reasonable to erase blocks older than one month. Erasing old blocks is implemented by removing them from the main chain. Additionally, the hash field of next block should also be reset to zero, indicating the initial block of the chain. 4.5.2 Block accessing When accidents occur, investigators acquire the relevant data by accessing the corresponding blocks. Specifically, estimate the time of the accident, create a range of timestamps and index the blocks using these timestamps, with an additional step of checking the hashes to be correct. Since multiple vehicles may generate blocks within this range, we select the necessary ones based on the VIN hash. Then retrieve the data chunks from the cloud using IDs. In next step, the data contained in the chunks will be analyzed. 35 4. Results 4.6 Data analysis Forensic data analysis examines the structured data with regard to criminal incidents, aims to reveal the truth of the incidents. Specifically for ADF, the task of this step is to take the chunks of data that have been processed in the previous steps as input, analyze them, and produce an output, i.e., the analysis result. It starts with narrowing down the scope of the analysis data, followed by analyzing the specific data, and finally establishing a timeline for this incident and reconstructing the incident based on the analysis result. It is important that the results are reproducible. This is because the results should be the same in or- der to provide reliable evidence no matter how many times the DA process is repeated. During the occurrence of incidents, there are expected some anomalies such as the vehicle over speeding, running red lights, crashing or being involved in hit-and-run incidents. Numerous data are stored in the cloud, but not all of them are related to the specific incident. Firstly, we narrow down the data scope based on the time when the event took place, which can be an exact time or a time range. We then retrieve the data chunks by the ID associated with the exact time or time range. Finally, all necessary chunks are presented to the investigators. Here we borrow the naming method of the terminologies in interaction provenance [52], and define: 1) Actor, the subjects interacting with each other in the IVN or V2X, such as the driver, the vehicle, the pedestrian, the signal light and the RSU. 2) Interaction, is a CASE file containing a description of a message or an action exchanged between the Actors. 3) Event: is a set of Interactions triggered by the Actors to perform an operation and, 4) Story: is a list of Events con- sisting of ordered Interactions. 5) Incident: a series of Stories ordered chronologically. Different incidents are always closely related to specific vehicle modules, which motivates investigators to focus on analyzing data of specific components. For example, analyzing the state of acceleration, braking, and steering is of great importance when in a crash. In this scenario, the forensic information is dug up by the path in the chunks as follows, category, subcategory of specific components, finally to the exact CASE files. When we get in touch with the CASE files, for its unified and forensic manner designed format, the information of the time, the location, the speed, the operation of the driver, the VIN of the vehicle, the messages and the actions between them are presented in a clear way, that is, the Interactions. The CASE files in the same subcategory of component form an Event of this component interacts with other actors. Whereafter the related Events combined together, the Story is presented. When all the Stories have been created, the timeline is established and the reconstruction of the Incident is complete. The entire procedure of data analysis is shown in Figure 4.5. 36 4. Results Fi gu re 4. 5: D at a an al ys is 37 4. Results 4.7 Documentation This is the last step of forensic lifecycle. Previous procedures and results are documented in this step, such as timestamps, tools, data sources and data analysis results. It is worth noting that only facts can be recorded. Any judgements, deductions or hypotheses should be excluded from the final report, which is later presented on the court. 4.7.1 Results collection Each of the previous steps takes inputs and generates outputs. Our task is to record all the outputs in a comprehensive and unified way. Figure 4.6 is an example of the results document that used to record all the relevant data of one investigation. The investigators are responsible for filling the form after completing each step. In the results document, the Case ID field, which is the serial number of the documents, is created first. Then the five W elements are recorded, namely Who (Investigator) and Whom (Driver), When (Date and the Timestamp), Where (Location), What (Type) and How (Step and Tools). Specifically, the Investigator is the subject, responsible for performing ADF and filling this document, while the Driver and Vehicle are the objects, including the driver’s personal information, the vehicle Model and VIN. Type indicates the type of incident, such as a crash, a hardware failure or a cyber attack. Details of the forensic process are also recorded, including forensic sub-steps, Block IDs along with their Validity, Chunk Size, Tools used in this step, and the Data Sources. Finally, a Description field records the entire life cycle of the incident in detail. This results document records the entire ADF procedures step-by-step and helps both the DA and DO steps to be repeatable and the results to be reproducible. 4.7.2 Reporting During previous steps, the timeline has been established, the accidents have been reconstructed and a number of results documents are available. In this final step, investigators report these findings on the court, marking the end of the entire forensic lifecycle. 38 4. Results Figure 4.6: Results document form 39 4. Results 40 5 Discussion In this chapter, we discuss our approach, with special attention on highlighting innovations and limitations. We then identify the ethical issues. In addition, we analyze the capability of the adversary in the attacker model. 5.1 Innovation As mentioned before, ensuring data security properties and unifying forensic data formats are identified as challenges of ADF. We addressed these problems by utilizing CASE and blockchain. More specifically, we summarized the following innovative aspects of our work: 1) Other solutions pay more attention to usability instead of privacy. In our solution, however, privacy is an important concern that should be well protected. We implemented it on the cloud by classifying data into several categories by different classification strategies, such as safety-related data, security-related data and personal data in terms of content. When performing ADF, the in- vestigators only need to access necessary data in corresponding category and subcategory. For example, in case of a car crash accident, safety-related data is of more interest than personal data, thus personal data related to privacy will not be disclosed. Classifying data is implemented by machine learning algorithms, which requires tremendous computing powers, we thus deploy it on the cloud. Another measure concerning privacy is that each block stores the hash of VIN, as it can uniquely identify a vehicle. Generally, knowing the hash of a string, one cannot compute the original text reversely. Therefore, due to this one-way characteristic of hash function, the VIN information can be protected. 2) Unifying forensic data formats is a problem for not only ADF, but all branches of digital forensics. Vehicle manufacturers, service providers and software developers have their own data formats and can lead to information being out of synchronization. CASE solves this problem by converting other data formats into JSON-LD, which makes data management and interpretation easier. With a unified data format, only one interpreter is needed, resulting in reduced complexity and improved interoperability. 3) The blockchain technology is originally designed as a distributed ledger, with the ability of ensuring data security properties by employing hash algorithms. 41 5. Discussion We adopted this concept to guarantee data integrity, availability and non-repudiation in ADF. Since blocks are linked through hash values, any attempt to modify one block can be detected by examining its hash, thus ensuring data integrity. Furthermore, non-repudiation is implied by the fact that a block mush have existed at the time to get into the chain. Due to duplication of the main chain in each node, data are always available as long as more than half of nodes stay honest. For confidentiality, we assume the cloud to be secure and will not be compromised. As a result, the data preserved on the cloud can also be considered confidential. However, ensuring the security of the cloud itself is out of our work’s scope. 4) A vehicle can produce a huge volume of data every day, it is infeasible to store or transmit them all. We have applied three methods to reduce data size. Firstly, there is a sub-step of local data filtering that aims at extracting the forensic-relevant data. Secondly, the actual data are stored on the cloud, which can be considered having unlimited storage space, while the ID is transmitted between the cloud and vehicle. ID is typically a 288-bit number and can easily be sent via internet. In addition, unnecessary blocks are erased after some time. All these measures can reduce data volume to a great extent. 5) The ID field consists of a timestamp and a hash of VIN. The times- tamp allows investigators to search for certain chunks, as well as recording the time of accident. The VIN field distinguishes the vehicle since it is unique around the world. Therefore, the concatenation of these two values is able to uniquely determine the data chunk of a desired vehicle and time. 5.2 Limitation We used multiple strategies to reduce the data volume and the complexity of the data structure, making it a lightweight solution. However, potential limitations still exist. 1) Storage limitation As assumed in subsection 4.5.1, for a sub-network of 100000 nodes, each running for 10 hours per day, the total space needed in a vehicle is approximately 11.4GB per month. On the other hand, the preprocessed data are all stored as chunks in the cloud. According to an estimation, a vehicle produces 25GB of data per hour, most of which are HD video and music streaming, or generated by web browsing [57]. Assuming 1/10000 of them are valuable for forensics, then 75TB of data are uploaded to the cloud per month in a sub-network. This is an ideal assumption. In a real scenario, the size of the sub-network may be larger than 100000 nodes, for example in the super cities such as New York. In addition, a certain number of vehicles are used for business, such as transportation vehicles and taxis, which usually run for more than 10 hours a day. 42 5. Discussion And the proportion of valuable data would be more than 1/10000. Furthermore, we unified the data in CASE format, which can add additional information, for example, the predefined fields in the format that form a CASE framework, such as the "@id", "@type", "description". The numerous number of such fields significantly increase the data volume. As a result, the actual storage space required for the vehicle or cloud might be slightly larger than assumed, but in an acceptable range. Vehicle storage space usually depends on the model and the specific configuration. In the state-of-the-art modern vehicles, there are larger storage spaces, and then it is not a critical issue, but the storage limitation is still worth considering. 2) Computation and time consumption Machine learning algorithms usually imply massive computational requirements. For example, in Convolutional Neural Networks (CNNs), which are useful and effective for classifying data, the training speed and prediction time of the models depend on the time complexity. The time complexity is the overall number of computations that the algorithms perform, and it depends on the computation times at each layer and the layer numbers of the CNNs models. In our framework, machine learning based approaches are widely used during data pre-processing sub-steps, i.e., data filtering and classification. Besides the algorithms, data format standardization can increase time complexity as well. Currently, there lacks an automated approach to convert data into CASE format and the conversion is tentatively done by human, which consumes a huge amount of time. In summary, our approach, with its innovations and advantages, is suitable for most common scenarios. Nevertheless, in extreme cases there are still concerns about the storage, computation and time consumption, which may be limitations of our approach. 5.3 Ethics Due to the immaturity of research in the field of ADF, there are ethical issues relating to both individuals and organizations. 1) Privacy of the stakeholders We mentioned previously that there are privacy concerns for ADF. To be more specific, when collecting accident-related data, we may unintentionally touch upon stakeholders’ private information, whether sensitive or not. For example, if the camera has taken a photo of an innocent pedestrian, then his 43 5. Discussion privacy is violated. Similarly, if the sound recorder has recorded irrelevant information about the driver’s private life, this may cause great inconvenience or trouble to the driver. With GPS, we also have to consider the problem that whether it is legal to share someone’s location. In particular, the state-of-the-art vehicles are usually equipped with a multi-function infotainment system, which increases the risk of revealing the privacy of those involved, not only the driver, but also the passengers and the pedestrian. GDPR is a comprehensive data privacy law that all industries and orga- nizations in the European region must comply with. For ADF, GDPR provides a legal guidance on the collection, processing, storage and transfer of personal data, meaning that all personal data must be handled in a secure manner to protect the privacy of the stakeholders. 2) Ethics of AI algorithms As machine-learning algorithms are used in our approach, the ethical is- sues of AI cannot be ignored. If AI algorithms are misused in criminal identification, it will inevitably compromise people’s rights. For example, attackers can imitate others biological characteristics like voice using AI algorithms. There can also be deviations by using different algorithms and data sets. An algorithm is in essence a process of using past data to predict the future, where the outcomes are determined by algorithms and the input data together. Therefore, these two factors become the main sources of deviation. In addition, the availability and accuracy of the input data can also affect the accuracy of the prediction. If there are malicious nodes existing in the networks, the data provided by them can affect the results of ADF. There are also human rights concerns arising from the ethics of algorithms. Algorithms are opinions expressed mathematically or in computer code. They are subjectively designed and chosen by the designers and the developers based on their own judgement [58]. In the ADF scenario, algorithms can help identify the criminal. There is a risk that just because someone has been a criminal once, they will always be a criminal in the algorithms. In this case, an innocent person can become a victim due to the mistakes of the algorithms. 5.4 Attacker model Here we will hypothesise attacks from adversaries and evaluate the impact on the system, the cloud and the vehicles respectively. In the attacker model, we state the potential purpose of the attackers, analyze their capabilities, and argue about how our solution is able to mitigate the threats. 44 5. Discussion Vehicles are in an insecure environment and there are various threat actors sabotaging them constantly, such as a person trying to alter digital evidence to avoid prosecution. Attackers can get access to forensic data from the cloud or from the vehicle. The cloud is connected to the vehicles via the internet, which opens the possibility of attacking. Therefore, a common attack vector is to compromise cloud servers to retrieve, manipulate or delete digital evidence. As the cloud stores forensic data with highest confidentiality level, it should not be compromised under any circumstances, which requires strict security measures. We assume that the cloud has applied state-of-the-art security measures to minimize potential threats. In addition, our assumption also includes a backup mechanism for cloud storage to keep a copy of the forensic data. For vehicles, the attack vectors are mostly passive attacks aimed at eaves- dropping. Hardware and software could be controlled or compromised depending on the access to the IVN and V2X communication, either wired or wireless. So there are several attack vectors, such as hacking through the OBD-II port to eavesdrop on traffic between ECUs and communication buses, or remote hacking of the infotainment systems to compromise personal information, or even physically destroy vehicle components. Thus, we cannot protect the actual data itself. However, it can be considered infeasible for the adversary to carry out undetected attacks to tamper the IDs stored in the chain with the nature of use of the blockchain. Our solution can mitigate the threats on vehicles since the blockchain net- work is resilient to data tampering. It is very unlikely for any individual attacker to control more than half of global vehicles. Therefore, as long as at least 51% nodes stay honest we can guarantee the digital evidences are reliable. Nevertheless, the security of cloud servers is not in our scope and left to be the future work. 45 5. Discussion 46 6 Conclusion In this section we identify what we have done, summarize and evaluate our work. We also identify what efforts can be made in future work based on our approach. 6.1 Conclusion Currently, there are many solutions for digital forensics, but few of them regard ADF. Forensics on vehicles is also an important topic for researching since there are over 5M car accidents every year [59]. Fast and accurately determining the causes of accidents has become an essential requirement. Some scholars proposed macro-level frameworks for ADF, while others presented solutions targeting specific problems, but they all have limitations. Lacking of standardized forensic data formats and difficulties in ensuring data security properties are two main limitations. To address them, we proposed an innovative solution utilizing CASE and blockchain. In this thesis, we started by introducing the background of ADF and re- lated concepts, identifying challenges and describing purposes, giving readers an overview of what is lacking on existing solutions. Follow that we introduced related works from four aspects, which showed what efforts have been done. Then we discussed details of literature review and similar areas, and borrowed some ideas from others’ work, e.g., utilizing blockchain to protect d