Decentralized Replication Using CRDTs In Systems with Limited Connection and Bandwidth Master’s thesis in Computer Systems and Networks Fredrik Johansson & André Perzon Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2019 Master’s thesis 2019 Decentralized Replication Using CRDTs In Systems with Limited Connection and Bandwidth FREDRIK JOHANSSON & ANDRÉ PERZON Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden 2019 Decentralized Replication Using CRDTs In Systems with Limited Connection and Bandwidth Fredrik Johansson André Perzon © FREDRIK JOHANSSON & ANDRÉ PERZON, 2019. Supervisor: Philippas Tsigas, CSE, Chalmers University of Technology Advisor: Erik Pihl & Fredrik Thune, CPAC Systems AB Examiner: Marina Papatriantafilou, CSE, Chalmers University of Technology Master’s Thesis 2019 Department of Computer Science and Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Vehicles synchronizing their databases Typeset in LATEX Gothenburg, Sweden 2018 iv Decentralized Replication Using CRDTs In Systems with Limited Connection and Bandwidth FREDRIK JOHANSSON & ANDRÉ PERZON Department of Computer Science and Engineering Chalmers University of Technology Abstract Distributed storage systems use replication to deal with problems such as fault tol- erance, availability of data, lowering latencies and scalability. However, a prominent problem with distributed systems is how to keep replicas consistent during partition- ing where ad-hoc solutions have proven to be error prone. For applications that do not require strict consistency, Conflict-Free Replicated Data Types (CRDTs) can be used. CRDTs provide a theoretically sound approach to replicating data under the eventual consistency model. Strict consistency is traded for improved availability, better response time and support for disconnected operations. Simple mathematical properties are used to solve inconsistencies between replicas that may occur due to the eventual consistency. The State-Based CRDT provides full availability, is commutative, associative and idempotent making it suitable for applications on unreliable networks. The State- CRDT sends its whole state straining the network as the state increase. To deal with the weakness of the State-CRDT, Delta-CRDT was proposed. The Delta-CRDT has the same properties as the State-CRDT but splits states into deltas, limiting the load on the network. The Delta-CRDT is more complex to implement than State-CRDT and is more vulnerable to message loss and requires more communication overhead. This research aims to examine the viability of replicating data by applying CRDTs in the same environment as an existing centralized system. The questions answered in this project are: Can CRDTs be used in a real-life environment with large data sizes and a low number of updates on a low bandwidth network with connection losses? Which type of CRDTs are suitable for this environment? How does the performance of the CRDT systems compare to the existing system? What are the trade offs between the different types of systems? The metrics used to evaluate the systems are operation latency, message latency, time to reach consistency, bytes sent and tolerance to message loss by looking at implemented quality arrays. Testing was conducted on a virtual network with data collected from a real-life scenario in which vehicles operating in a mine synchronize their databases through a centralized system on an unreliable network with low bandwidth. The results show that the State-Based CRDT system is not viable due to flooding the network with too many and too large messages. The Delta-State CRDT system perform better than the existing system in most metrics and is a viable replacement for the existing system if availability is prioritized. Keywords: CRDT, Distributed Systems, Centralized, Decentralized, Replication, Strong Eventual Consistency v Acknowledgements We would like to acknowledge CPAC Systems AB for their support, allowing us to use their facilities, providing us with the data that made it possible to accomplish this thesis and treating us as a part of the group from the very start. A special thanks to Fredrik Thune and Erik Pihl for their time and supervision. We would also like to thank our supervisor at Chalmers, Philippas Tsigas for his valuable guidance and feedback throughout the thesis. Fredrik Johansson & André Perzon Gothenburg, June 2019 vii Contents List of Figures xi List of Tables xii List of Source codes xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 3 1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 State of the Art 5 2.1 Replication in Distributed Systems . . . . . . . . . . . . . . . . . . . 5 2.2 Replication with Centralized Management . . . . . . . . . . . . . . . 6 2.3 Replication with Distributed Management . . . . . . . . . . . . . . . 7 2.4 Consistency, Availability, Partitioning; The CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 Conflict-Free Replicated Data Types . . . . . . . . . . . . . . . . . . 10 2.6.1 State-CRDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6.2 Operation-Based CRDT . . . . . . . . . . . . . . . . . . . . . 12 2.6.3 State-CRDTs with Delta-mutators . . . . . . . . . . . . . . . 13 3 Design and Implementation 15 3.1 Centrally Managed Replication Without CRDTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 CRDT System Variations . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Software Design Layout . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Merge Operation . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 State-based CRDT - State Machine . . . . . . . . . . . . . . . 19 3.2.4 Delta-CRDT with Snapshots - State Machine . . . . . . . . . 19 3.3 Virtual Test Environment - Mininet Network Emulator . . . . . . . . 20 3.4 Network Communication between Replicas . . . . . . . . . . . . . . . 23 3.5 Data Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 ix Contents 4 Experimental Study 27 4.1 Setup of Virtual Test Environment . . . . . . . . . . . . . . . . . . . 29 4.1.1 Local Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.2 Number of Updates Per Vehicle . . . . . . . . . . . . . . . . . 30 4.1.3 Network Performance . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.4 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Converge Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Reference System Converge Time . . . . . . . . . . . . . . . . 34 4.2.2 Centralized Converge Time . . . . . . . . . . . . . . . . . . . 35 4.2.3 State-CRDT System Converge Time . . . . . . . . . . . . . . 36 4.2.4 Delta-CRDT System Converge Time . . . . . . . . . . . . . . 37 4.3 Message Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Message Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Bytes Sent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6 Merge Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7 Quality Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5 Discussion 55 5.1 Discussing the Questions Asked for this Project . . . . . . . . . . . . 55 5.1.1 Can CRDTs be used in a real-life environment with large data sizes and low number of updates on a low bandwidth network? 55 5.1.2 Which type of CRDTs are suitable for this environment? . . . 56 5.1.3 How does the performance of the CRDT systems compare to the existing system? . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.4 What are the trade-offs between the different types of systems? 56 5.2 This Project’s Delta-CRDT with Snapshots Design . . . . . . . . . . 57 5.3 CRDT Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4.1 Possible CRDT Optimizations . . . . . . . . . . . . . . . . . . 59 5.4.2 Designing Databases to Suit CRDTs . . . . . . . . . . . . . . 60 6 Conclusion 63 Bibliography 67 x List of Figures 2.1 An illustration of a centralized replication. . . . . . . . . . . . . . . . 7 2.2 Illustration of decentralized replication. . . . . . . . . . . . . . . . . . 7 2.3 Illustration of the CAP theorem, the center is the desired state but unreachable as stated by Brewer. . . . . . . . . . . . . . . . . . . . . 8 2.4 Example of an eventual consistent system providing old data to client due to slow propagation of updates. . . . . . . . . . . . . . . . . . . . 10 2.5 Example of integer state with Max merge function converging . . . . 12 3.1 Illustration of the system design for the thread that performs all actions. 16 3.2 Illustration of the system design for the thread that handles connec- tions from other nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 System software layout of State/Delta-CRDT systems . . . . . . . . . 18 3.4 Flowchart of State-based CRDT . . . . . . . . . . . . . . . . . . . . . 19 3.5 Flowchart of Delta-CRDT . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6 Simple and Linear topologies. . . . . . . . . . . . . . . . . . . . . . . 21 3.7 Torus and Tree topologies. . . . . . . . . . . . . . . . . . . . . . . . . 22 3.8 System design of the Centrally managed system used as of today. . . 22 4.1 The Chepstow quarry, where data has been collected, with the po- sition of active vehicles marked with vehicle symbols. Picture from CPAC Systems [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 An illustration of how the local updates are created and perform on the systems in a consistent manner. . . . . . . . . . . . . . . . . . . . 30 4.3 The basic idea behind the script performing systematic connects/disconnect on the nodes during the test. . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Size of messages sent by 16 nodes, 5s broadcast interval, 1.35% dis- connect rate. The boxes include 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Size of messages sent by 16 nodes, 5s broadcast interval (except State- CRDT which has 15 second broadcast rate because of crash during testing), 5.4% disconnect rate. The boxes include 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. . . . . . . . . . . . . . . . 40 xi List of Figures 4.6 Latency to receive messages between nodes for systems with 16 nodes and 5s broadcast interval and 1.35% disconnect rate. The boxes in- clude 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. . 42 4.7 Latency to receive messages between nodes for systems with 16 nodes and 5s broadcast interval and 5.4% disconnect rate (except State- CRDT which has 15 second broadcast rate because of crash during testing). The boxes include 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.8 Latency to perform a merge operation in systems with 16 nodes and 5s broadcast interval and 1.35% disconnect rate. The boxes include 25-75percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. . . . . . . 48 4.9 Latency to perform a merge operation in systems with 16 nodes and 5s broadcast interval and 5.4% disconnect rate (except State-CRDT which has 15 second broadcast rate because of crash during testing). The boxes include 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.10 Quality arrays created by a node where each line is an array of mes- sages received or not received from other nodes in the system. 1 is received message while 0 is lost message. State-CRDT system, 16 nodes, 15s broadcast interval, 5.4% disconnect rate. . . . . . . . . . . 52 4.11 Quality arrays created by a node where each line is an array of mes- sages received or not received from other nodes in the system. 1 is received message while 0 is lost message. Delta-CRDT system, 16 nodes, 15s broadcast interval, 5.4% disconnect rate. . . . . . . . . . . 53 5.1 Splitting of a large state to gain smaller states to decrease message sizes before sending across the network. . . . . . . . . . . . . . . . . . 59 xii List of Tables 2.1 The eight fallacies of distributed systems. . . . . . . . . . . . . . . . . 6 4.1 Results from gathered data about the actions performed. . . . . . . . 30 4.2 Time intervals displaying the number of updates performed by each node during testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Table of the signal quality from RSSI value. . . . . . . . . . . . . . . 32 4.4 A table of the hardware specifications used during testing . . . . . . . 34 4.5 Table of the different converge times from tests on Reference system. 35 4.6 Converge times gathered during tests with various setting for the Centralized system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.7 Table of the converge times gathered during tests for the State-CRDT system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.8 Table of the converge times gathered during tests with various setting for the Delta-CRDT system. . . . . . . . . . . . . . . . . . . . . . . . 38 4.9 Bytes sent by 16 nodes with 5 seconds broadcast interval and 1.35% disconnect rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.10 Bytes sent by 16 nodes with 5 seconds broadcast interval and 5.4% disconnect rate (except State-CRDT which has 15 second broadcast rate because of the 5 second test crashing during testing). . . . . . . . 46 4.11 Summary of the gathered results from the quality arrays. This is a representation of the quality array displaying ( the maximum per- centage of drop messages by a node, maximum number of dropped messages in sequence by a node). Tests are done with 16 nodes and 15 second broadcast rate for all systems. . . . . . . . . . . . . . . . . 54 xiii List of Tables xiv List of Source Codes 1 Custom created network topology that simulates the simple single switch topology seen in Figure 3.6b . . . . . . . . . . . . . . . . . . . 21 2 Backend communication on all nodes listening for incoming connections. 24 3 How the systems send data between nodes. . . . . . . . . . . . . . . . 25 4 Code for message reception on nodes. . . . . . . . . . . . . . . . . . . 25 5 Collected connection data. . . . . . . . . . . . . . . . . . . . . . . . . 31 xv List of Source Codes xvi 1 Introduction Distributed storage systems use replication to deal with problems such as fault tol- erance, availability of data, lowering latencies, scalability and load balancing of large number of parallel connections. However, a prominent problem with distributed sys- tems is how to keep replicas consistent as ad-hoc solutions have proven to be error prone [30]. Conflict-Free Replicated Data Types (CRDTs) provide a theoretically sound approach for replicating data under the eventual consistency model. This research aim to examine how different types of CRDTs perform in an environment where connection and bandwidth is limited and how the different trade offs between the CRDT types may correlate to the results. 1.1 Background Keeping replicas in a distributed storage system both consistent and available to accept new updates is a problem that the CAP-theorem [5] states is impossible when partitioning is present. As partitioning is not a design choice in a system and is not possible to be predicted or avoided, networks can disconnect or devices fail for any reason, it must be taken into account. However, it is possible to mitigate the CAP-theorem through the use of consistency models. The general concept is that stronger consistency lowers availability and higher availability lowers consistency [20]. Depending on the application different consistency models are suitable to find the proper balance between consistency and availability. CRDTs use eventual consistency which is a consistency model that allows replicas to diverge, i.e., to accept new updates without prior synchronization with other repli- cas, thus providing high availability [15]. They use simple mathematical properties to solve any inconsistencies that might occur when the replicas perform updates independently of each other. CRDTs were first implemented in practice for example collaborative text editing and peer-to-peer computing and then later formalized by Shapiro et. al. [30], [31]. Both collaborative text editing and peer-to-peer com- puting value high availability in order to maximize the effective time they can do operations, i.e, minimize the time they need to wait for synchronization. 1 1. Introduction 1.2 Problem Statement There is a need to replicate data across multiple databases which are located on board vehicles over a network where partitioning is expected to be high due to the environment. When the vehicles are connected to the network the bandwidth and latency will be limited because of them having 3G connection limiting the amount of data the system can send to reach consistency. As the vehicles still are operational during the presence of partitioning there is a demand for high availability in the system even when there is partitioning on the network. The vehicles are currently using a centralized replication system which does not provide full availability but provides consistency in the form of linearizability. By removing the single point of failure in the centralized system, there is potential to gain better performance in the environment in which the vehicles are deployed in addition to higher availability. CRDTs were designed to provide consistency in systems where small updates are performed frequently, for example the number of times a song has been played. In the case of keeping track of how many times a song has been played the amount of data sent is small since the state consist of integers. When it comes to replication of databases where each entry has several parameters the amount of data is much larger and updates are subsequently often less frequent. In such a scenario CRDTs should not be preferable, especially if the system is using a network with low bandwidth. Throughout this thesis real-life system, real-life scenario and real-life environment refers to the scenario described in this problem statement. Real-life is used to differ- entiate from the virtual systems implemented in this thesis. The questions asked at the beginning of this project were: Can CRDTs be used in a real-life environment with large data sizes and a low number of updates on a low bandwidth network with connection losses? Which type of CRDTs are suitable for this environment? How does the performance of the CRDT systems compare to the existing system? What are the trade offs between the different types of systems? In order to answer the questions a number of metrics will be compared to measure the performance of each system, these evaluation metrics are explained in Section 1.5. All testing was conducted on a virtual network configured with data collected from the system in the scenario described in this problem statement. 1.3 Aim The aim of this research is to examine the viability of replicating data by applying CRDTs in the same environment as an already existing replication protocol. For this reason research about the strengths and weaknesses of different CRDTs is done. Based on this research two different CRDT solutions are created, one basic and one more advanced. In addition the performance is recorded in different metrics during 2 1. Introduction run-time to be able to compare the CRDT solutions with the existing solution. Drawing from the problem statement the following questions are researched: • Can CRDTs be used in a real-life environment with large data sizes and low number of updates on a low bandwidth network? • Which type of CRDTs are suitable for this environment? • How does the performance of the CRDT systems compare to the existing system? • What are the trade-offs between the different types of systems? The results from this thesis are to be used as guidelines when investigating the areas where the CRDTs can be used for replication of data in addition to where they can outperform more common replication protocols as the centralized system with strong consistency. 1.4 Scope There is no possibility to test the replication protocols on the vehicles directly as they are in production, instead all testing will be made in an emulated environment. To be able to reach the goals of the thesis, data is gathered about the behaviour of the vehicles such as updates performed as well as the quality of the network; bandwidth, latency and connectivity. This is done to be able to simulate an environment that is as close to the real-life scenario as possible to gain an accurate performance validation of the CRDTs. The performance of the different systems are compared in the following metrics; total bytes sent, message latency, message size, update latency, converge time and quality arrays which displays the messages lost where both the total number of lost messages in addition to the number of consecutive messages lost. 1.5 Performance Evaluation Metrics Different metrics are used to evaluate the communication times the first is message latency, the time in milliseconds it takes from that a connection is established until the entire message has been received. The next metric is the message size which is the size in bytes of each message a node send to another node. Finally, the bytes sent is the total amount of bytes each node in the system have sent during a test. To be able to draw conclusions regarding the processing times two different metrics are used. Firstly, the merge latency metric is used, the time in milliseconds it takes from a message, consisting of a state, is received until the receiving node has merged 3 1. Introduction the state with its own state. Secondly, the message size is also used to evaluate the processing times as there is a relation between the size of a message (state) and the message latency. Finally, the accuracy of a system is measured by looking at the converge time, the time in seconds it takes for the system to reach convergence, from the start of the test. The test start at time 0 but systems can not converge before 300 seconds because the system is still being updated for the first 300 seconds. In addition to converge time a metric called quality arrays was created to record which messages were dropped by which node and from that information derive how long time nodes are out of sync with the rest of the nodes. The quality arrays can be used to compare how the systems are able to handle different disconnect rate on the network. 1.6 Limitations Simulations are used instead of real world testing. The cost is simply too high to do real world testing and would most likely produce new challenges that need to be overcome for the specific real world setting thus making it unreasonable for the time frame of this project. The simulated problems the system encounter are loss of connection (partitioning) and low bandwidth (hardware limitation). Detecting byzantine or faulty nodes are not dealt with. All nodes will be considered to work correctly with no other faults than those mentioned above. The same database structure, that is already present in the centralized solution operating on the vehicles, will be used for all models. The database on the master node is a PostgreSQL database and vehicles’ local databases are SQLite databases. Throughout this thesis only SQLite databases are used as using different databases might provide skewed results if their performance is not identical. This also enhances the development time as time is limited. Time will not be spent to try to create optimized queries, efficient sorting etc. The limitations for this project are: • Real world testing – Will not have access to any vehicles – Simulations will be used instead • Byzantine behaviour nodes – No nodes are malicious – All nodes send correct messages, e.g., not broken – Nodes have robust memory in case of a crash • Database performance – Efficient sorting, queries etc. will not be considered 4 2 State of the Art State of the Art covers theory of fundamentals that all designers of distributed sys- tems must consider in order to create a functional system. Section 2.1 presents the definition of a distributed system as well as fallacies regarding distributed system design. The two fundamental types of distributed systems, centralized and decen- tralized, are explained in Section 2.2 and Section 2.3 respectively. Section 2.4 explain the CAP-theorem which is an aspect all distributed system designs must consider in order to perform as intended. Linked to this is the different consistency models used in distributed systems, explained in Section 2.5. Section 2.6 provide the theory behind the Conflict-free Replicated Data Types (CRDTs). They provide strong eventual consistency and are therefore attractive to distributed system designs where availability is important despite the presence of partitioning on the network. CRDTs are the main focus of this thesis and are implemented and tested in two out of three systems. 2.1 Replication in Distributed Systems The introduction of network communication between different machines in the 60’s and the continued technological advances during the next decades set the foundation for the machine-to-machine communication as we know it today [28]. Machine- to-machine communication provided the ability for machines to coordinate their work towards a common task by communicating. This gave birth to a new kind of system, distributed replication systems, or distributed systems for short, are defined as multiple devices connected via a network cooperating to perform some task [26]. Sabu M. Thampi explains one of the many reason why distributed systems have grown with the entrance of the information era, Thampi defines in his paper [25]: "When a handful of powerful computers are linked together and com- municate with each other, the overall computing power available can be amazingly vast. Such a system can have a higher performance share than a single supercomputer." The structure of such a system can be constructed in two different ways: central- ized, where one node is the master over a number of slave nodes in the network or decentralized where all nodes have the same priority. In a centralized system all communication from the slave nodes are directed towards the master node while in the decentralized system communication is made between arbitrary nodes. The pros 5 2. State of the Art and cons of the two approaches will be discussed in Section 2.2 and Section 2.3. In distributed systems there exist eight fallacies that distributed system designers often forget to take under consideration and that leads to design flaws that eventually will appear when the system is in production. The fallacies were first mentioned back in 1994 by Peter Deutsch, but they are still relevant for designers today. Arnon Rotem-Gal-Oz summarizes the eight fallacies in his paper [22] and the eight fallacies can be sen in Table 2.1. Table 2.1: The eight fallacies of distributed systems. 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology does not change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous Depending on what type of system the designer is looking to create the importance of each fallacy may differ. A closed system on a local network may always have the same topology whilst an application on a network with nodes continuously connecting and disconnecting does not, the topology is changing with each connect/disconnect (breaking fallacy 5). Another thing to point out is that some fallacies are losing their importance for designers. As performance and reliability of hardware increases, failures on a network become less likely [22]. Throughout this project the designs will not take into consideration point 4, 6 and 8 in Table 2.1 because of the limitations mentioned in Section 1.6. 2.2 Replication with Centralized Management A system with a set of replicas P1, P2, ..., Pn where n− 1 replicas are symmetric but one node is different in terms of possible states and actions is called a centralized system. Figure 2.1 displays a possible network of five replicas where four out of the five replicas are symmetric and the middle replica differs from the other replicas. This is the primary replica and the others are secondary replicas. Secondary replicas that receive updates do not apply the updates immediately. In- stead they notify the primary replica that updates have been received and wait for the main replica to process consequent updates. An example would be when two secondary replicas receive an update at the same time, notify the primary replica at the same time and applies the updates in the same order due to the decision made by the primary replica. 6 2. State of the Art Figure 2.1: An illustration of a centralized replication. By performing all updates from the secondary nodes on the primary node, the level of consistency is linearizable and sequentially consistent. It is a simple approach for solving the problem of concurrent updates on multiple nodes. There are pros and cons for choosing to build a centralized system, where simplicity is a strong argument for this kind of system [18]. A weakness in a centralized system is the single point of failure, all secondary nodes rely on the primary node for updates. If the primary node never responds with updates to the secondary nodes they can never update and thus lose their availability and consistency. The system is also faced with the possible bottleneck of the primary node not having the performance to handle all requests from the secondary nodes making the whole system not function properly. If this is the case then it does not matter if the secondary nodes might have enough performance to handle their tasks if the primary node is the point of congestion. 2.3 Replication with Distributed Management A system that has decentralized replication consists of a set U containing N nodes. The nodes can be identical but does not have to be. Each node k ∈ U has a connection to m ⊆ U nodes as is illustrated in Figure 2.2. The size of subset m for each node is not a fixed size and may change over time. Figure 2.2: Illustration of decentralized replication. 7 2. State of the Art All nodes in the system perform actions to reach a consistent state by communicating with other nodes. Actions are performed locally and the changes of the state are communicated to nodes currently in the subset m. The updates made propagate across the network and reaches all nodes N ∈ U either directly or indirectly if no node k ∈ U crash or disconnect indefinitely. In this case one could argue that U = U − k. In comparison to centralized replication decentralized replication has no single point of failure, all nodes can continue working if any node k ∈ U disconnects. The advan- tage of no single point of failure leads to higher complexity for keeping consistency when replicating data as all nodes must reach consensus. The complexity of the replication depends on the level of consistency required in the system. Stronger consistency requires more complex solutions than weak consistency. 2.4 Consistency, Availability, Partitioning; The CAP Theorem A problem designers of distributed systems are faced with is the CAP Theorem [5, 6]. CAP stands for Consistency, Availability and Partitioning. When designing a distributed system you want the system to have strong consistency, always be available and still be operational even when the network is partitioned due to any kind of failure. In the year of 2000, Brewer [5] made a conjecture stating that it is impossible to provide both consistency and availability when the network is experiencing partitioning. Figure 2.3: Illustration of the CAP theorem, the center is the desired state but unreachable as stated by Brewer. When a distributed system has geographically long distanced replicas of data and information is sent between the replicas, network partitioning will be present [23]. What the CAP theorem states is that there are two main options when handling a system that is in a partitioned state. One of the options is to block all opera- tions to replicas thus keeping consistency but providing no availability. The other 8 2. State of the Art is to allow all operations to replicas even though it breaks consistency [8]. However, choosing between Consistency and Availability is not exclusive, trade offs between consistency and availability can be made [13], for example reading but not writing may be allowed or certain data might be changed thus keeping some level of relation between consistency and availability. An illustration of Brewers conjecture is illus- trated in Figure 2.3 where it is impossible to reach the area where all three circles overlap. Strong consistency is desirable but costly for the system’s performance and it also block operations thus lowering the availability of the system [20]. A bank transfer requires stronger consistency, because of the nature of the transaction, than for example Amazon Dynamo [11] which has sacrificed strong consistency for better availability. 2.5 Consistency Models There are different consistency models, linearizability (strong), sequential, casual, eventual and weak consistency. The order in which they are mentioned corresponds to the strength of consistency [2]. As mentioned in Section 2.4 which consistency model a system needs depends on the area the system is to be used in. In this thesis eventual consistency is the main focus which is a weak consistency model that can be defined as: if no more updates are performed, all nodes will eventually converge to the same state within a finite time. Systems with weaker consistency models provide the ability to perform updates without the need for consensus and/or synchronization between nodes. This makes weaker consistency models better at operating in low bandwidth networks prone to disconnects because of the message complexity needed for synchronization between nodes [27]. This makes weaker consistency models popular with the growing need for available and scalable services like the Amazon Dynamo [11] and Facebook’s Cassandra [21] which both implement eventual consistency. Applications that use eventual consistency models are able to tolerate an inconsistent state for periods of time. Updates between different replicas occur rarely and instead updates are performed on the local replica to maintain availability for the client. Clients may read old data due to the slow replication of updates as shown in Figure 2.4 where a client deletes (1) the value "cat" from database B, "cat" is then removed (2), the client then performs a read (3) operation on B and finds it is empty. The client reads (4) once again but this time from another replica A, this replica returns the value "cat" as A and B have not yet synchronized (5) their data. The Amazon shopping cart is a famous real-life example of how this issue can become a problem [15]. Where items would reappear in clients shopping carts that had been removed earlier due to receiving "old" data from another replica that had not received the remove update performed by the client on the original replica where the remove was performed. 9 2. State of the Art Figure 2.4: Example of an eventual consistent system providing old data to client due to slow propagation of updates. 2.6 Conflict-Free Replicated Data Types Conflict-Free Replicated Data Types (CRDTs) are distributed data types that pro- vide a theoretically sound approach to enable distributed objects to be eventually consistent and non ad-hoc [31]. Ad-hoc approaches have previously proven to be error prone with the anomalies of the Amazon Shopping Cart [11] as a well known example where removed items in the shopping carts may reappear. Eventual consis- tency is ensured by CRDTs through the use of simple mathematical properties. As explained in Section 2.5 eventual consistency allow for a process to execute an opera- tion without synchronizing with other replicas beforehand thus providing availability of a replica to clients even in the presence of partitioning. This thesis will use the same terminology of atoms and objects as defined by Shapiro et. al. in [31]. They define that an atom is an immutable data type, for example integers and strings, that can be copied by processes. An object is a mutable, repli- cated data type that has an identity, a content which can be any number of atoms or objects, an initial state and an interface consisting of operations. Operations are called by clients to either acquire data from or modify an object. When a client update an object the operation is first called from the interface of the object and then the update is disseminated asynchronously. Two objects with the same identity located at different nodes are replicas of one another. There are two types of CRDTs: State-Based CRDTs (referred to as State-CRDTs in this thesis) and Operation-Based CRDTs [24] with the former being focused on in this thesis. The two types are explained in Section 2.6.1 and 2.6.2. CRDTs is a concept that was first implemented in practice in a cooperative text editor software [14] and then later formalized by Shapiro et. al. [31]. Another practical use of CRDTs is found in latency tolerant computing. Processes in a distributed computing system that use CRDTs can maximize the time they spend on actual computation and lower the amount spent on waiting for synchronization since CRDTs do not use consensus. 10 2. State of the Art The strength of CRDTs, providing full availability by not using consensus, is also a strong limitation of where CRDTs can be used. A trivial case where CRDTs cannot be used is when consensus is imperative for example with transfers of money in a bank. Something as simple as division creates trouble when designing a CRDT as division neither is commutative, associative nor idempotent, properties that a CRDT must fulfill. CRDTs do, despite full availability, provide strong eventual consistency. Strong eventual consistency does not guarantee that atoms in an object constitute coherent information from a human point of view. In the case of collaborative text editing software the CRDT only eventually guarantee that all writers see the same text. It does not guarantee that the text is grammatically correct. 2.6.1 State-CRDT An operation on a State-Based CRDT object occurs directly on the source and is then disseminated by sending the entire modified state to other replicas. When another replica receives a state U , it merges its own state A with the received state, meaning that the set difference (U \ A) is added to A. The merge operation follows the rules of a join-semilattice and is therefore commutative, associative and idempotent which are essential properties since State-CRDTs propagate their states asynchronously. State-CRDTs are easy to reason about and due to the states that are sent require few guarantees from the network, if a state is not able to be transferred properly it will be sent again without interrupting any other functionality. This is why State-CRDTs can function under extreme partitioning. The simplicity of State-CRDTs comes with the drawback of unbounded growth. Large states have a negative impact on the performance of the network especially when scaling the system. A workaround to this is presented in Section 3.2.4. A simple example of a State-CRDT, illustrated in Figure 2.5, is the CRDT that solve inconsistencies by using Max integer operation. The base state all replicas start with is an integer with value zero. Each replica may increment the integer an arbitrary number of times. The difference in amount of increments leads to diverging values on the replicas. Every replica sends its state which in this case is a single integer to its neighbours. When a replica receives an integer it persists the larger of the received integer and its local integer. If no further updates are made the replicas will all converge to the same state, i.e. having the same integer value, after a finite number of steps. 11 2. State of the Art Figure 2.5: Example of integer state with Max merge function converging 2.6.2 Operation-Based CRDT Operation-Based CRDTs (Op-CRDTs), unlike State-CRDTs, requires updates to be delivered on a reliable channel; updates are delivered in the correct order. Op- CRDTs are also able to reason about history, giving them more expressive power while at the same time making them more complex to implement. The Op-CRDT has three fundamental parts (notation taken from [30]): payload, query and update which is split into two parts atSource and downstream. The payload is an operation that is to be executed asynchronously at other replicas while a query is an operation that does not alter the state of the local source replica. The update is a function with the first part atSource which is executed if its precondition is true after evaluating the arguments it takes. It executes entirely at the local replica atomically and may return calculated results to the caller and/or prepare arguments for the second part, downstream. Downstream is first triggered if its precondition is true either by arguments from atSource or by the arguments contained in a payload. When the downstream has executed all the operations the state of the replica has been altered and as long as the network delivers all messages in causal order all replicas will eventually converge into the same state. Since most operations are not associative, commutative and idempotent by them- selves reasoning about history must be implemented together with the Op-CRDTs. This is done either by relying on the network delivering messages containing oper- ations at most once and in a causal order or by the replica maintaining a history about which messages have been delivered which would be part of the downstream’s precondition. Being able to reason about history gives Op-CRDTs more expressive power while at the same time making them more complex to implement. 12 2. State of the Art 2.6.3 State-CRDTs with Delta-mutators Delta-CRDTs, proposed by Almeida, Shoker and Baquerro [4], is a solution to the State-CRDTs’ problem of ineffective dissemination as the states get bigger. Delta- CRDTs create delta-states that have smaller message sizes by using delta-mutators. Every local update is first run through the delta-mutator which is a function that creates the delta-state, only containing new information, which is then applied to the local replica before being broadcasted to all other nodes over the network. Delta-states are not idempotent in comparison to the states of State-CRDTs meaning that reasoning about history similar to Op-CRDTs must be taken into consideration. When nodes do updates often they create entropy by sending many small delta-states that are also propagated directly between nodes which may congest the network. A way to limit this entropy is to accumulate messages up to a threshold which once reached trigger the sending of the delta-states. Almeida et al. [4] propose two algorithms for anti-entropy of the delta-states, one basic that sends the whole state periodically and another that uses unique IDs for each delta-state that are garbage collected once all nodes have acknowledged a delta-state. 13 2. State of the Art 14 3 Design and Implementation In this chapter the designs and implementations of the three systems are presented. This chapter starts by explaining the designs of the different systems. The first Section 3.1 being the Reference system which is a Centrally managed system, as explained in 2.2. Following are the other system designs, Section 3.2, that are decentralized management systems and use CRDTs to reach consistency. After the CRDT design Section 3.2 continues explaining the design behind the merge operation that is a vital part in both of the CRDT systems in Section 3.2.2. The different design variations of the CRDT systems are then further ex- plained in Section 3.2.3 together with 3.2.4. In Section 3.3, Virtual Test Environment, the network emulator with its functionality and usage is presented. This is the environment which the systems will be deployed to and that in turn has implementation of backend network communication. Section 3.4, Backend Communication, explain different choices of network layer protocols and why sockets running TCP was deemed the best option. Real-life scenario and vehicles are in this section referring to those mentioned in the Problem Statement in Section 1.2. The real-life scenario is emulated by the virtual network and the vehicles are represented by replicas in the virtual network. The programming language used was Python. SQLite was used to create and handle databases. Python was chosen because of its excellent support for both Mininet and SQLite. 3.1 Centrally Managed Replication Without CRDTs The implementation of the Centrally managed system was based on the information received about the system in use in real life when the thesis started. As mentioned in section 3.3 the Centrally managed system have N −1 slave nodes and one master node. The master node in the real-life system is using PostgreSQL though in this thesis the master node runs SQLite the same way as the slave nodes. By using the same type of database there was the possibility to reuse the same solutions for all nodes making implementation faster and easier in addition to giving a fairer comparison for the CRDT systems as the speed of the database structures should 15 3. Design and Implementation not affect the results. A restriction to not focus on database performance was made before the thesis started and the usage of the same database type aligns with this restriction. Figure 3.1: Illustration of the system design for the thread that performs all actions. The simulated nodes can perform three different actions to alter the state of the database, insert, update and delete. All nodes can perform actions but in the Centrally managed system the master node perform actions immediately. The slave nodes on the other hand send the actions they receive to the master node which performs the actions i.e. updates the state, as illustrated in Figure 3.1, the state is then broadcasted to all slave nodes. When slave nodes receive the state they update their own state. Figure 3.2: Illustration of the system design for the thread that handles connections from other nodes. The master node repeatedly broadcasts its state to all other nodes. When the slave nodes continuously merge the received states from the master node they stay synchronized. If a node loses connection to the network it can still perform updates even though they will not be seen until the connection to the master node is restored. Figure 3.2 Illustrates the behaviour of the primary and a secondary node when a message is received from another node. 3.2 CRDT System Variations State-Based CRDT systems were determined to be better suited than Operation- Based CRDT systems in a scenario were connections are unreliable and thus the order in which messages are received cannot be guaranteed. Shapiro et. al. show 16 3. Design and Implementation that it is possible to emulate Operation-Based CRDTs with State-Based CRDTs and vice versa [30]. This is exactly what would happen if an Operation-Based CRDT was used since the same information the State-Based CRDT sends would have to be sent by the Operation-Based CRDT. For this reason a primitive State-Based CRDT as well as a more advanced and com- plex version of State-Based CRDT, Delta-CRDT, have been used with corresponding explanations in 3.2.3 and 3.2.4, respectively. Common aspects of both types are that access and storage of data which is made via the SQLite databases which support the operations described in section 3.5. When a node detects a previously unknown node it creates a copy with the structure (without the data) of its own database and assigns the newly discovered node’s ID to the new copy. This means that each node will have a copy of the other nodes’ databases where a database depicts a node’s state. When a node is merging an atom, a container for the entire state or a delta-state of another node, it only access the database that the atom is to be entered in. We make the following assumptions which are fundamental for the systems to func- tion: Assumption 1. Every atom anm can only originate from node Nn Assumption 2. Entries in a state are discarded if they do not follow sequential numbers e.g. update ui+1 can only be applied if update ui exist in the database. Assumption 3. Broken messages are discarded and nodes have robust memory which survive crashes without being corrupted. 3.2.1 Software Design Layout The software layout for the State/Delta-CRDT systems can be seen in Figure 3.3. State/Delta-Topology (Mininet), first box, sets up a virtual network with the desired number of nodes and the network configuration supplied. On each node a State/Delta-Backend(server), second box, is started. Each Backend listens for incoming connections on a certain port, the ability to setup connections to other Backends for sending messages as well as State/Delta-CRDT, third box, object which provides utility to converge with other nodes. The State/Delta-CRDT object has the ability to create new databases, one for each new node, in addition to all other database actions as querying for the current state, merging state and delete. All database actions are made using the DbConnect object, fourth box. The State/Delta-CRDT object has the ability to create new databases, one for each new node, in addition to all other database actions as quering for the current state, mergeing state and delete. All database actions are made using the DbConnect object, fourth box. 17 3. Design and Implementation Figure 3.3: System software layout of State/Delta-CRDT systems 3.2.2 Merge Operation Both State-CRDTs’ state and Delta-CRDTs’ state S are implemented as two grow- only sets (Y + X), called 2P-sets by Shapiro et. al. [31]. An atom may exist in either Y or X or both. Grow-only set Y contains all atoms and a grow-only set X, also called graveyard, contains deleted atoms. Thus, doing Y −X will show all not deleted atoms. Both sets are represented as tables in the same database. The CRDT systems use a merge operation to reach convergence across all nodes. Ac- tions preceding the merge operation differ between State-CRDT and Delta-CRDT, explained in Section 3.2.3 and 3.2.4, but they both use the merge operation. The merge operation is using the same principle as the Union in set theory. The set of nodes A = {N1, ..., Nn}, where n is the total number of nodes in the network, each node Nk∀k ∈ {1, .., n} has a state Sk. Each Sk has a set of sub-states Gm∀m ∈ {1, .., n}. The sub-state Gm=k is the local sub-state of node Nk with the other sub-states Gm 6=k being representations of the local sub-states from other nodes Nm. The system have reached convergence when ∀Nk, Sk = Sm∀m 6= k. A node Nk+1 receiving a state Sk from another node applies the merge operation. The merge operation done by Nk+1 does the following with the received state Sk; ∀G ∈ Sk+1 = ∀G ∈ Sk ∪ ∀G ∈ Sk+1. The sub-states are updated with all the updates that differ between the received sub-states from Sk and the current sub- states in Sk+1. When Nk+1 has applied the merge operation it has at least the newest information in Gk unless local updates has occurred at Nk during message transit and the processing overhead of the merge operation. If no more updates occur in the network, no nodes disconnect and the network delivers all messages then all nodes will eventually convergence to the same state. 18 3. Design and Implementation 3.2.3 State-based CRDT - State Machine The State-Based CRDT that has been used, is primitive in every sense which is visualized through the flowchart in Figure 3.4. Nodes make local updates which are broadcasted periodically, by sending all data the nodes possess to all other nodes, whenever their timer reaches 0. When broadcasting no consideration is made for if there’s been new local updates, if a state is yet received by another node or if other nodes need to be updated. Figure 3.4: Flowchart of State-based CRDT The State-Based CRDT utilize the merge operation explained in Section 3.2.2 to reach convergence on all nodes. The merge operation become more inefficient the larger the states sent are. 3.2.4 Delta-CRDT with Snapshots - State Machine The Delta-CRDT designed and implemented in this thesis builds upon the State- based CRDT explained in 3.2.3. The merge operation is exactly the same while the differences that makes it more efficient is found with the state snapshots and calculations of delta-states. A state snapshot describes the current state of a replica. Each update performed by a node locally is stored in its own database and each update receives an auto- incremented key, an integer. The state snapshot consist of the highest incremented key of each table in each database the node has. Values are mapped to table names which in turn are mapped to their database in a nested dictionary, i.e., nested hash table. The state snapshot, which is considerably smaller than the full state, is broadcasted to all nodes. Node A that has received a state snapshot from node B creates a state snapshot of its local databases and calculates a delta snapshotAsnapshoti −Bsnapshoti = δsnapshoti where i is [1, 2, ..., n] in the snapshot dict. δsnapshot is then used by B to create the delta-state, containing only entries that A does not have, and send it 19 3. Design and Implementation directly to A. A then applies the merge operation, explained in Section 3.2.2, on the received state. Figure 3.5: Flowchart of Delta-CRDT The entropy that Delta-CRDTs create and the subsequent need for garbage collec- tion, explained in Section 2.6.3, is circumvented by the design of the Delta-CRDT in this project, seen in Figure 3.5. This is due to the broadcasting of state snapshots and direct communication of delta-states between nodes, i.e., no states are directly propagated between nodes thus no need for garbage collection. Furthermore, this design keeps the idempotent property of the CRDT which the Delta-CRDT pro- posed by Almeida, Shoker and Baquerro [4] does not. Reasoning about history is therefore not necessary with this design. 3.3 Virtual Test Environment - Mininet Network Emulator Mininet is an emulator that can emulate complex network topologies [19]. The em- ulated network contains hosts, links, switches and controllers which are the building blocks of the network structure. Furthermore Mininet has different connection set- tings within the virtual network which for this project are used to create limitations in bandwidth and connection in addition to simulate the network partitioning that occurs in the real life scenario. The simulated partitioning allows for testing the systems with different levels of connectivity. Mininet is designed for testing and development and it provides functions for con- current development. In the setting of the project there is the possibility of hosts (vehicles) being added to the network dynamically. Mininet is scalable with the number of hosts that can be added to the topology [19]. 20 3. Design and Implementation Mininet has different levels of functionality. A command can be written at the com- mand line to build predefined network topologies within the Mininet software. The different topologies within Mininet are Tree, Simple, Reversed, Torus and Linear. Mininet has the possibility for the user to build custom topologies by writing python scripts to declare the structure of the desired topology. (a) Linear topology with four hosts and switches. (b) Simple topology with nine hosts and a single switch. Figure 3.6: Simple and Linear topologies. 1 from mininet.topo import Topo 2 from mininet.net import Mininet 3 from mininet.cli import CLI 4 5 class CustomSingleTopo(Topo): 6 def build(self, no_hosts = 4): 7 hosts = [self.addHost("host%s" % (h+1)) for h in 8 range(no_hosts)] 9 switch = self.addSwitch("Switch1") 10 for host in hosts: 11 self.addLink(host, switch) 12 topos = {'customsingletopo': (lambda: CustomSingleTopo())} 13 singleswitch = CustomSingleTopo(4) 14 network = Mininet(topo=singleswitch) 15 network.start() 16 CLI(network) 17 network.stop() Source Code 1: Custom created network topology that simulates the simple single switch topology seen in Figure 3.6b When designing a custom topology there are four building blocks that can be used; controllers, switches, hosts and links. The functionality of the building blocks can be customized by the designer in different ways. For links there is the possibility to decide bandwidth, percentage of packets lost, maximum packet queue size and delay in milliseconds stating the time it takes for the packets to go from A to B [19]. On hosts CPU power as well as the number of cores available for hosts to use can be declared. Switches have the option to choose which transfer protocol and what type of switch to be used. In addition to all the options available to choose from, 21 3. Design and Implementation programming how the building blocks should be connected as well as the amount of each component can all be decided within the script. Due to not being able to test the systems on the vehicles in the real-life scenario this thesis is based on a custom made topology that is similar to the predefined simple topology that is mentioned in 3.6b in section 2. In the real life scenario the vehicles are not able to talk directly between themselves so they use a cloud service to relay messages for them which is illustrated in Figure 3.8. In the virtual network the cloud is emulated by a switch. (a) Tree topology with a depth of two and a fan-out of four. (b) Torus topology with nine nodes and hosts. Figure 3.7: Torus and Tree topologies. In the Centrally managed system all hosts are not equal, in the real-life system there is the central host which can be seen in Figure 3.8 named "Cloud". A custom simple topology was made with N − 1 slave hosts and one master host in the network with a total of N hosts. The master node represents the central database in this scenario. All hosts have limited processing power and a connection to the switch with limited bandwidth that disconnects on predefined times. Figure 3.8: System design of the Centrally managed system used as of today. 22 3. Design and Implementation To achieve this a Python script was written that is made up of two classes, Custom- Topo and CustomTopology. CustomTopo extends the Topo class from the Mininet package. In CustomTopo there is a function called build which must exist in the Topo class as it is run when an object of the Topo class is created. In build the number of hosts, switches and links in the network topology is defined. Hardware performance of the hosts like number of cores, how many percent of CPU the host is allowed to use and IP address is also defined in the script. The number of cores was set to two, the IP address of the hosts set to "20.1.90.ID" where ID is the ID number of the host in range of 1, 2, 3, ..., n. The CustomTopology class sets up the network and the performance of the network. First an object of CustomTopo is created with the number of desired hosts. Then the performance of the links is defined followed by the switches. Next all different settings for the parts are combined with the addition of other settings like IP base into a network object by using the Mininet class. This object represents the network and it can be started and shutdown by built in functions. Once the network is up and running individual hosts (nodes) can be selected in the network and run programs on them. This is used to run the script containing one of the different backend solutions that will be compared in this thesis. Once the hosts are running their backends there is the possibility to start a simple command-line interface for the network with the CLI() function within Mininet. The command- line interface enable the possibility to break link connections between the hosts and the switch by typing certain commands in the CLI. As we want to recreate the exact same behaviour in all tests we can not write every command individually. Instead once the hosts are running their backends we start a script that creates the connection behaviour we want and that provides reliable results for all tests. 3.4 Network Communication between Replicas In all the different systems tested in this thesis the communication is done by using sockets running TCP. Several libraries that would have provided a sufficient back- end solution were considered, SocketServer, SimpleHTTPServer and Sockets or a combination of SimpleHTTPServer and SocketServer. SimpleHTTPServer uses the SocketServer to send its messages and the SocketServer in turn uses Sockets. Based on the low bandwidth and the high rate of connection loss SimpleHTTPServer was out-ruled as the extra data added to each message by the HTTP protocol would increase message size for functionality that was deemed unnecessary for the systems [7, 10]. The SocketServer and the Sockets were left as options with Sockets being the better option for the backend core communication as it meant the bare minimum for the systems. Each node act as both server and client in the network. In all systems nodes listen for connections from other nodes and connect to other nodes to send updates or states. The server part of the node was built by setting up a socket that has been configured for IPv4 with Transmission Control Protocol (TCP). An IP-address together with 23 3. Design and Implementation a port is bound to the socket. When setting these parameters the socket can listen for incoming connections. A loop was created that checks for incoming connections, when connections are established threads are created to handle each new connection to avoid blocking new incoming connections. The code used in all systems created in this project can be seen in Source Code 2. 1 # AF_INET -> ipv4 and SOCK_STREAM -> tcp 2 sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 3 4 #Starting server 5 sock.bind((IP, PORT)) 6 7 # Listening for connections 8 sock.listen(5) 9 while True: 10 # someone connected to the socket 11 connection, connectioninfo = sock.accept() 12 13 # Creates a new thread to handle each connection 14 thread = Thread(target=handleconnection) 15 thread.daemon = True 16 thread.start() Source Code 2: Backend communication on all nodes listening for incoming connections. To be able to send data in an environment where connections may be lost at any time the recipient must be able to acknowledge if the entire message has been received. Since it is a stream of data and chunks of data can be sent at different speeds the receiver does not know how much data it should wait for. To counter this problem an "end of transmission byte" in the form of a semicolon is appended to the message which can be seen on line 1 in Source Code 3. Once the data is ready to be sent the size of the data is saved in order to estimate when all bytes of the message have been sent. The data is sent by the send function which returns the number of bytes sent. It transmits as many bytes as possible but there is no guarantee that all bytes are sent. Therefore after each call to send, the number of sent bytes is saved to the total bytes sent variable, then continues to send all the bytes that have not yet been sent. If send returns 0 the connection has been lost and the send attempt is canceled. This is displayed on line 6-7 in Source Code 3. 24 3. Design and Implementation 1 data = (DATATOSEND+ ";").encode() 2 datasize = len(data) 3 totalsent = 0 4 while totalsent < datasize: 5 sent = sock.send(data[totalsent:]) 6 if sent == 0: 7 break 8 totalsent += sent Source Code 3: How the systems send data between nodes. As mentioned before the data stream that is sent is ended by a semicolon. The re- ceiver continuously compares every incoming byte individually to the end statement. If the received byte is not the end byte nor EMPTY/NULL the byte is appended to the received message. If the end byte is received the loop is broken and the message is delivered. The order of the received bytes is maintained thanks to the sockets using a TCP connection. If an EMPTY/NULL byte is received the connection is broken and the entire message is dropped. The code for the recipients is illustrated in Source Code 4. 1 while True: 2 byte = connection.recv(1) 3 byte = byte.decode() 4 if byte == ";": 5 receivedMessage = True 6 break 7 elif byte: 8 data += byte 9 else: 10 break Source Code 4: Code for message reception on nodes. 3.5 Data Persistence The databases that nodes use to persist data in the virtual network are SQLite databases. It has support for Python, has good documentation and is the type of database that the vehicles in the real life scenario use. The nodes have the exact same database layout as the vehicles with the exception of foreign keys (relations between tables). All tables in the database are using a primary key column that is an auto increment integer called _ID. This means that when an insert is performed to the table, a value is added which is one value higher than the _ID value of the last inserted. 25 3. Design and Implementation In this thesis the functions the database can perform has been limited, these can be seen in the API "DbConnect" in Figure 3.3. This API allows both the State/Delta- CRDT systems and the Centrally managed system to create new databases from a hard-coded layout, add entries to tables, query the database to get the full state or desired parts of it, check if entries or databases exist, get snapshots (only used by Delta-CRDT), delete entries and update entries. DbConnect contain no decision making, all decisions are done by the systems them- selves, it just allows the systems to communicate with the database/databases. 26 4 Experimental Study In this chapter the results from the gathered data representing the connectivity of the network, the behaviour of the vehicles in the real-life system, all seen in Section 4.1, in addition the results from the tests made in the simulation are discussed and shown. The results are used to determine the performance between the different implemented systems, Delta-CRDT, State-CRDT and a Centrally managed system each discussed in Sections 3.2.4, 3.2.3 and 3.1. A Reference system has been created to be used to compare and draw conclusions on the performance of the other systems. The Reference system is a centrally managed system operating in a test environment with perfect conditions. Perfect conditions in the Mininet network simulator(virtual environment), explained in Section 3.3, is bandwidth of links set to 1 Gbit/s, a latency of 10ms and there are no connection loss in the system. The Reference system is ran in the virtual environment with the same number of nodes as the other systems 8, 13 and 16. The tests for the Reference system are made to be used as a reference when comparing the performance of the systems by measured metrics that will be defined in the following paragraph. The centralized system and the CRDT systems have ran the same test script with the same predefined settings of number of nodes, broadcast time interval and discon- nect rate. The Reference system differ by having better bandwidth, latency and a disconnect rate of 0%. In Section 4.1 information is given to clarify how the test script was created how the settings for the virtual nodes and virtual environment were determined based on data collected from a system in production at CPAC Systems. This is then succeeded by a section with the specifications of the hardware the virtual environment, systems and tests were run on. The following sections present the results of each system of the performance metrics that were measured. In Section 4.2 the converge time metric’s results are presented. Converge time is the time it takes in seconds, from the test start, for all nodes in the system to converge to the same state. The converge time gives a hint of the overall performance of the system, a high converge time suggest that the system have had difficulty to handle the load on the network or nodes have not been able to load with the local calculations. Because of how it is implemented the converge time will always be ConvergeT ime ≥ TestDuration. 27 4. Experimental Study Section 4.3 presents the message size, the number of bytes sent message from a node has, metric’s results. The message size can be compared to see how much strain the system as a whole puts on the network, the bigger messages sent the longer it takes for the receiver to get the message and thus take up resources on the network. The message size is measured by the node before sending each message. The message size has a direct correlation to the size of the state on the nodes for multiple of the examined systems as the message sometimes is the entirety of the state. Following Section 4.3 is the message latency metric which is presented in Section 4.4. The message latency is the time it takes from the time a connection is setup between two nodes until the entire message has been received at the receiving node. The message latency metric shows the performance of the network. High message latency times indicate that the network has been flooded with too many and/or too large messages. There is a direct correlation between the message size and the message latency if the bandwidth is constant. Section 4.5 present how many bytes in total was sent during the test. The target of this metric is to get a more comprehensive view of how much data is being transferred in the system. The bytes sent provides another dimension to the amount of messages sent in the system as a whole and how the total amount of bytes is spread between the nodes in the system. Each time a message is sent the size of that message is added to the total bytes sent metric. At the end of the test it contains the total sum of messages sent. Section 4.6. The merge latency is the time it takes for a node to perform an op- eration. Operation in this context can either be a received message from another node, which correlate to a state, or a local update which can be seen as a state with a single atom. When a state is received by a node it performs a merge opera- tion, discussed in Section 3.2.2, to update its own state. The merge latency is the time from a state is received until the merge operation has been completed. This metric provide information on how the nodes in the systems handle the calculations required to perform their actions and if they are flooded with operations does this affect the latency negatively, if so, to what extent. Finally, in Section 4.7 is discussion on the results of the quality arrays. Quality arrays is defined as arrays of success or failure of message deliveries from other nodes. Quality arrays is not a performance metric but rather a way to see that nodes actually have diverged during the test. In the section it will be discussed how the different systems diverge and how the nodes reacquire the same state after diverging. The results presented and discussed in this section are a selection from the extensive amount of data generated by the tests. The results are summarized and the focus has been on the performance of the systems for tests made with 16 nodes as this is the most demanding setting for the systems, to see all the gathered results from the test they can be found at: https://github.com/perzonas/exjobb/tree/master/results. 28 4. Experimental Study 4.1 Setup of Virtual Test Environment The test environment consist of a virtual network created with Mininet, explained in 3.3, in which nodes are running a script that simulates local updates that are propagated differently by the nodes in each system. Figure 4.1: The Chepstow quarry, where data has been collected, with the position of active vehicles marked with vehicle symbols. Picture from CPAC Systems [1]. The test data and parameters are based on the data produced by vehicles running in the Chepstow mine, seen in Figure 4.1. In this unreliable environment messages are dropped because vehicles disconnect from time to time. The data is used to get as accurate inputs as possible for the systems in addition to the overall accuracy of the simulation. 4.1.1 Local Updates For consistent testing and testing the systems in a similar way to the real-life sce- nario, updates from each node are performed to mimic how the vehicles would receive them. Every node continuously read a row from its local file where each row repre- sents an action performed by the vehicle that is to be applied to the local database. In order for all the updates not be applied at once there is a script that writes to N local files, N being the number of nodes for the test, at predefined times. This way the timing of updates can be controlled and all tests will be identical for all the systems. An illustration of the process is shown in Figure 4.2. 29 4. Experimental Study Figure 4.2: An illustration of how the local updates are created and perform on the systems in a consistent manner. 4.1.2 Number of Updates Per Vehicle How much load in terms of updates to put on the systems was based on data gathered from 13 active vehicles. The data provided the number of actions from three different time intervals, Short Interval (48h), Medium Interval (1 week) and Long Interval (3 weeks), each with a different time frame (30min, 1 hour, 3 hours). The service provided to extract the gathered data did not support the option to go further into depth i.e. to look at smaller time frames. It did not allows us to see how the updates were distributed during the time frames available to us. We focused on three different time intervals, can be seen on the last row of each interval’s result seen in Table 4.1. Table 4.1: Results from gathered data about the actions performed. Short Interval (48h) Results Average updates per hour 421.888889 Average updates per hour per vehicle 32.452991 Highest number of updates in a 30 min window 737 Medium Interval (1 week) Average updates per hour 631.938967 Average updates per hour per vehicle 48.610690 Highest number of updates in a 1 hour window 3295 Long Interval (3 weeks) Average updates per hour 614.625571 Average updates per hour per vehicle 47.278890 Highest number of updates in a 3 hour window 9088 When looking at the results the interval with the highest number of updates in Table 4.1 is used as a guideline when creating the script that performs the local updates, discussed in Section 4.1.1. If the system is able to handle the higher number of updates it will be able to handle lower loads as well. Updates are performed in waves to be as close to real-life as possible, when a vehicle starts working in the beginning of its shift it does not perform all of its work directly, the work is spread 30 4. Experimental Study across the entirety of the day. Each wave of updates are different to see if the systems can handle a high number of actions in a short period of time in addition to lower loads. Table 4.2: Time intervals displaying the number of updates performed by each node during testing. Time (s) 0-50 50-95 95-135 135-170 170-200 200-225 225-245 245-260 260-270 275-280 >280 Updates 5 10 3 3 1 10 3 5 20 15 5 The behaviour of the vehicles in real-life can also be viewed as performed in waves as they perform their updates at different speeds, a digger loading a dumper truck performs more frequent updates than a dumper truck transporting material from point A to point B. The different waves can be seen in Table 4.2, in the end of the test (245<) more than 50% of the updates are performed to simulate a burst of updates with almost the same amount of updates as the average updates per hour in Table 4.1. The behaviour of the local updates should affect the results and thus provide interesting end results. 4.1.3 Network Performance To be able to simulate a test environment as close to the real-life vehicles network data was collected from the vehicles in production. The data, containing network performance collected from the vehicles were structured as shown in Source Code 5. 1 { 2 "timestamp": "1555579143274", 3 "signalStrength": 18, 4 "networkOperator": "Tele2", 5 "dataStatus": 2, 6 "location": { 7 "longitude": 11.9994135, 8 "latitude": 57.647903 9 } 10 }, 11 { 12 "timestamp": "1555579143274", 13 "signalStrength": 99, 14 "networkOperator": "unknown", 15 "dataStatus": -1, 16 "location": { 17 "longitude": 11.9994135, 18 "latitude": 57.647903 19 } 20 } Source Code 5: Collected connection data. 31 4. Experimental Study The different fields of importance are an epoch timestamp which is the number of milliseconds from January 1st 1970 and signalStrength, the quality of the connection in Arbitrary Strength Unit (ASU). There are different ASU values depending on what kind of signal it is used for, in this case it is GSM (3G) [3]. With GSM (3G), ASU can have values between 0 and 31 as well as 99 where 99 means that the signal is undetectable (disconnected). NetworkOperator is ignored as it provides no useful information in this context. DataStatus displays if the vehicle is connected, 2, or disconnected, -1. Location is also ignored as only connectivity is of interest but could be used to find connectivity issues of certain areas. The signal quality is measured in dBm and for the 3G network it is called Received Signal Strength Indication (RSSI) [9]. To change the ASU value into an RSSI value the following formula is used: 2 ∗ ASU − 113 = RSSI[17] The calculated RSSI value can then be used to gain knowledge about the quality of the signal. The signal quality is divided into 5 different levels [12], these are shown in Table 4.3. Table 4.3: Table of the signal quality from RSSI value. RSSI QUALITY <-70dBm Excellent -70dBm to -85dBm Good -86dBm to -100dBm Fair >-100dBm Poor -110dBm No signal As the signal quality declines the ability to transfer data is lost faster than the availability of telephone communications. Already at poor connection quality in Table 4.3 there is next to no data available [29]. The collected data show the following results: • Total number of seconds measured: 6426 • Total number of seconds disconnected: 87 • Total number of seconds connected: 6339 • Downtime percentage: 1.354% • Number of disconnects: 3 • Average time of disconnect: 29 seconds • Average time between disconnects: 32min & 8 sec • Highest signal strength recorded (ASU): 26 • Lowest signal strength recorded (ASU): 1 • Average signal strength (ASU): 16.1325 • Average signal strength in dBm (RSSI): -80.7349 • Average signal strength is classified as: Medium (good voice and data) 32 4. Experimental Study From the results it can be deduced how often and for how long connections should be removed between hosts, explained further in Section 4.1.4. In addition to this it can be interpreted that the average quality of the connection is sufficient for providing both data and voice to the nodes. There is no direct formula to provide the upload and download speeds for a 3G connection based on the signal strength in either ASU nor RSSI. To be able to find a bandwidth for the network, the connection speeds of a 3G connection were analyzed. There are many different possibilities for doing this depending on the generation of 3G used. It ranges from 0.1 Mbit/s up to 8 Mbit/s in upload speeds and 0.3 Mbit/s to 42 Mbit/s in download speeds [12, 16]. In this case the upload speed is the most important of the two since a node can not download any more than another node can upload. Thus the upload speed is setting a limit for the connections. During testing we will be keeping the speed at 1.5 Mbit/s, which is usually the throughput you receive from a 3G network [12]. We believe this is the correct choice for the scope of this thesis. The average latency of all the 3G networks in Australia is 150ms [16]. Ken Lo claims that the average latency for 3G overall is 100ms [12]. Based on this we will perform the test using a latency of 150ms to simulate bad connectivity and the systems will work even better with better latencies. 4.1.4 Connectivity To be able to simulate connectivity loss, a script was made that runs in parallel with all the hosts and with the local updates script mentioned in Section 4.1.1. This script changes the status of the links between the hosts and the switch, the switch simulate the network which all nodes connect to. When the status is changed to off no packets can be sent across that specific link. The status can be swapped freely at any time from either on or off. The time a host is disconnected is brief as mentioned in Section 4.1.3. Figure 4.3: The basic idea behind the script performing systematic connects/disconnect on the nodes during the test. The idea is that a test should be 300 seconds long and with 1.35% disconnect time that will result in a 4 second disconnect for the entire test. The 4 seconds does not have to be sequential and can be split up into multiple shorter sequences. In the test data there were 3 different disconnects all with similar length. To recreate the real 33 4. Experimental Study life scenario 3 disconnects were simulated each lasting for ≈ N/3 seconds, where N is the the total time a node should be disconnected in seconds. For the real life test N equals 3. During the test two different disconnect patterns is used. Some nodes disconnect 3 times and other disconnect 2 times but for longer periods of time. It is good to have variety in the duration of disconnects as well as longer periods of downtime. In the real life data the disconnects are not equal and to choose a longer duration disconnects on some nodes will provide a more interesting result. The tests will be based on the information gathered from the data. 4 second dis- connect duration based on the 1.35% will be used as the base case. Additional tests with increased disconnect duration will be performed to simulate scenarios with lower connectivity than the connectivity deduced from the gathered data. This is to see if there are any changes in the performance of the different systems when they are challenged with lower connectivity. 4.1.5 Hardware All the tests have been performed on a Lenovo Thinkpad T420. To get a fair representation of the results a single machine was used for all the tests. Using multiple machines could have potentially skewed the results for this thesis. The specifications are shown in Table 4.4. Table 4.4: A table of the hardware specifications used during testing LENOVO THINKPAD T420 Processor Intel® Core™ i5-2520M CPU @ 2.50GHz Memory 8GB DDR3 Storage Samsung MZ7PA 128GB SSD Operating system Ubuntu 18.04.2 LTS 4.2 Converge Time In this section the converge time of the systems are presented and discussed. Con- verge time is the time it takes for the system to converge to the same state on all nodes from the start of the test. The converge time can not be less then 300 seconds because of how the test is built, there are local updates being performed for the first 300 seconds and the system can not converge before all local updates have been introduced. 4.2.1 Reference System Converge Time The converge times of the Reference system tests are close to nothing. The test can not be completed before 300 seconds, as this is the length of the test. In Table 4.5 are the converge times of all the test done with perfect conditions. The results clearly states that the system converges in less than 0.3 seconds independent of the 34 4. Experimental Study number of nodes and broadcast interval. Table 4.5 also displays 2 patterns, that the converge time increases with the number of nodes in the system as well as with a decreasing broadcast interval even if only slightly. Table 4.5: Table of the different converge times from tests on Reference system. Number of Nodes Broadcast Interval 5 15 25 35 8 300.0339 300.0163 300.0157 300.0165 13 300.15034 300.0346 300.0357 300.0337 16 300.27895 300.0843 300.0481 300.0882 4.2.2 Centralized Converge Time The convergence times for the centralized system are shown in Table 4.6. All results are similar with converge times between 300 to 301 seconds with the exception of the results from the 16 nodes. The load on the system increases with the size of data and the number of nodes. For each additional node in the system the total data size in the system increases in addition to the number of nodes the master node needs to send its state to when broadcasting. Table 4.6: Converge times gathered during tests with various setting for the Centralized system. broadcast intervalDisconnect rate Number of Nodes 5 15 25 35 4 Seconds/Test, 1.35% 8 300.0313 300.0287 300.0328 300.0303 8 Seconds/Test, 2.7% 8 300.0161 300.0173 300.0168 300.0179 16 Seconds/Test, 5.4% 8 300.0166 300.0209 300.0155 300.0166 4 Seconds/Test, 1.35% 13 300.0679 300.0647 300.0688 300.0651 8 Seconds/Test, 2.7% 13 300.0361 300.0376 300.0371 300.0355 16 Seconds/Test, 5.4% 13 300.0400 300.0354 300.0350 300.0386 4 Seconds/Test, 1.35% 16 335.2897 300.0917 300.0935 300.0958 8 Seconds/Test, 2.7% 16 358.6397 300.0492 300.0511 300.0519 16 Seconds/Test, 5.4% 16 427.7964 300.0621 300.0466 309.4215 The result show that somewhere between 13 and 16 nodes with a 5 second broadcast interval the system starts having problem to maintain the one second window all the other test have managed to converge in. The network is getting flooded with messages slowing down the message latency as discussed in Section 4.4. This affects the converge time negatively and the test also show that this is effected by the DR of the systems. A higher DR provides a higher converge time when the network is being flooded. When the network is not flooded the results are similar independently of the DR as can be seen in all results with 13 nodes or less. There is one outlier in the 35 4. Experimental Study result which is the 16 nodes with 5.4% DR. Because of the other test with 35 second broadcast interval being able to converge in < 301seconds, it could be because of the timing of the broadcast is badly timed or one node lost the last message due to a disconnect and must wait for the next broadcast. 4.2.3 State-CRDT System Converge Time The converge time for the State-CRDT system is growing fast with the increase of nodes as well as the broadcast interval. For 8 nodes the converge time is steady within 300 < ConvergeT ime < 301, but with 13 nodes results are already showing worse converge times. Looking at the test with broadcast interval of 5 seconds, the results are significantly worse, as the converge times range from 1400 to ≈ 3300 seconds, seen in Table 4.7. The network is not able to handle the amount of data sent between the nodes, all messages are transferred at slower speeds as more message are sent concurrently. Once the message latency > broadcast interval the network enters a vicious circle where each following broadcast will only slow the network down further. In the tests made, slowing down the network also leads to a slow down of all other actions performed by the nodes as everything is simulated on the same machine. Table 4.7: Table of the converge times gathered during tests for the State-CRDT system. broadcast intervalDisconnect rate Number of Nodes 5 15 25 35 4 Seconds/Test, 1.35% 8 300.1140 300.1078 300.1079 300.1072 8 Seconds/Test, 2.7% 8 300.1311 300.0535 300.0576 300.0596 16 Seconds/Test, 5.4% 8 300.0593 300.0616 300.0579 300.0573 4 Seconds/Test, 1.35% 13 1953.8584 300.3037 300.3017 300.2878 8 Seconds/Test, 2.7% 13 1407.7268 300.1439 300.1490 300.1684 16 Seconds/Test, 5.4% 13 1400.2475 314.5570 300.1450 300.1416 4 Seconds/Test, 1.35% 16 4798.6765 985.2645 472.7948 300.4186 8 Seconds/Test, 2.7% 16 6001.0533 1171.3792 574.9403 300.2352 16 Seconds/Test, 5.4% 16 >160001 1344.4049 640.9116 343.6800 The broadcast interval has a bigger effect on the converge time than the DR. There are vast improvements when the broadcast interval is decreased. For example in the test with 2.7% DR the converge time for 16 nodes starts at 6000 seconds for the 5 second broadcasts and decrease with each level of broadcast interval down to 300.2352 for the 35 second broadcast interval. That makes it a decrease of 95% in converge time, the same number is ≈ 96% for the test with 5.4% DR when using the last gathered value from the test, 16000 seconds, before the computer crashed to calculate. The reason for the computer crashing is that during the test the system reaches a point where the message latency is longer than the broadcast interval 1The computer used during testing ran out of memory and crashed 36 4. Experimental Study rate. So for each additional broadcast the network is slowed down and for each message in transit a new thread is started. With more threads being added than threads getting killed due to finishing its task, the computer runs out of memory and crashes. Each times this happened the operative system needed to be repaired. The connections does not timeout because they are sent at very low speeds and not broken or lost. We could have implemented a maximum time for the message latency before dropping the connection on the receiver side, but then the systems might have stopped working completely as all messages could get dropped if the message latency would be higher than the threshold. If we would use different thresholds for each system the results would be skewed and invalid. Comparing the converge times from the different systems the Reference system does not have any indication that it reaches any limits with high frequency broadcast during the 16 node test. The centralized systems starts getting worse converge times when tested with 16 nodes and a 5 second broadcast. In contrast, the State-CRDT is getting worse result with 13 nodes and 15 second broadcasts. 4.2.4 Delta-CRDT System Converge Time The Delta-CRDT system manages to converge within one second for almost all tests with a broadcast interval of 15 seconds or more. During the two most demanding tests in DR for this broadcast interval with 16 nodes the system is not able to converge within one second of 300, shown in Table 4.8. This is an improvement compared to the State-CRDT system which already had an increased converge time of 314 seconds when using 13 nodes with 5.4% DR in addition to a converge time of 343 seconds with a 35 second broadcast interval with 16 nodes. The State-CRDT system gets better results when the broadcast interval is increased and the same can be seen for the Delta-CRDT system. When excluding all converge times less than 301 seconds, the Delta-CRDT system also follow the same pattern as the Centralized and State-CRDT system where a higher DR increases the converge time of the system if all other parameters remain unchanged. For most tests the converge time of the Delta-CRDT system and Cen- tralized system, and all test for the Reference system, are within the time interval 300 ≤ ConvT ime ≤ 301 seconds which is considered acceptable and a good con- verge time for the system used today. For the 5 seconds broadcast interval the Delta-CRDT system is having trouble to maintain the converge times of the Cen- tralized system. Since more data is sent for the Delta-CRDT system the network gets flooded and slowed down compared to the Centralized system. Since the Delta-CRDT system with 25 second broadcast interval, for all number of nodes, have a good converge time this would be a plausible solution to replace the Centralized system with a higher broadcast interval when only taking converge time into consideration. Even though the broadcast interval is higher than what the Centralized system can handle without losing convergence time the machines using Delta-CRDTs can still be updated equally often or even more so, as they can receive 37 4. Experimental Study Table 4.8: Table of the converge times gathered during tests with various setting for the Delta-CRDT system. broadcast intervalDisconnect rate Number of Nodes 5 15 25 35 4 Seconds/Test, 1.35% 8 300.1067 300.1109 300.1079 300.1084 8 Seconds/Test, 2.7% 8 300.0572 300.0564 300.0560 300.0573 16 Seconds/Test, 5.4% 8 308.3991 300.0551 300.0543 300.0540 4 Seconds/Test, 1.35% 13 384.1841 300.3399 300.2772 300.2781 8 Seconds/Test, 2.7% 13 420.1244 300.1711 300.1460 300.1455 16 Seconds/Test 5.4% 13 588.4631 300.1397 300.1426 300.1446 4 Seconds/Test, 1.35% 16 1204.7629 300.5428 300.4104 300.4158 8 Seconds/Test, 2.7% 16 1418.3686 333.5832 300.2087 300.2159 16 Seconds/Test, 5.4% 16 1758.2876 367.1295 300.2191 300.2161 updates from an arbitrary machine and updates can propagate along any path. If a slave node N in the Centralized system miss a message from the master node M , due to connectivity issues, it has to wait for the next broadcast from M before updating its own state. The time N has to wait to receive an update is equal to broadcastinterval ∗ 2, there is the possibility that the next message may not reach node N either, giving the time to receive an update if the first message is dropped ≥ broadcastinterval ∗ 2. For a node N in the Delta-CRDT system that time would be > broadcastinterval as N does not have to receive the update from node M directly. The update can take alternative paths, M → K → D → N or M → K → N where M,N,K,D ∈ P where P is the set of all nodes in the system and M 6= N 6= K 6= D. Another node may receive the update and the receiving node can then broadcast the new updated state to the node that missed the original broadcast thus the message can be received at N after ≤ broadcastinterval ∗ 2. 4.3 Message Size In this section the message size results measured during testing are shown and discussed. The message size is the size of each message sent by a node during the test. The message size metric’s primary use is to see what improvement the Delta-CRDT provides. As described in the prelude to Chapter 4 the amount of data generated from the tests is extensive. The results of the message size tests are therefore presented by comprehensive boxplots as can be seen in Figure 4.4 and 4.5. All data can be viewed at https://github.com/perzonas/exjobb/tree/master/results. Both Figure 4.4 and 4.5 present the tests with 16 nodes and 5 seconds broadcast in- terval. The difference is the disconnect rates which are 1.35% and 5.4% respectively. The reason for presenting the tests with the most nodes and lowest broadcast inter- val is because they were the tests the systems struggled the most with. Looking at 38 4. Experimental Study tests with longer broadcast interval only increases the average and median message size. This is due to the nodes accumulating more data between the broadcasts as the data generation is not dependent on the broadcast interval. Figure 4.4: Size of messages sent by 16 nodes, 5s broadcast interval, 1.35% disconnect rate. The boxes include 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. The results shown in Figure 4.4 and 4.5 are quite similar. The maximum size of messages remain the same despite an increase in disconnect ratio. This is expected since the data that is predefined to be produced in the test is a fixed amount. The increase in disconnect ratio does however affect the upper quartile of the size of the messages sent in the Centrally Managed system and the overall message size of the State-CRDT system. The reason is that the nodes that are disconnected are still producing data. Therefore, when they reconnect they send the accumulated data resulting in larger message sizes. The Delta-CRDT system has the same mean (dotted line) regardless of which discon- nection rate is being used. The Delta-CRDT system’s nodes does accumulate data while being disconnected just like the Centrally Managed system and State-CRDT system. The reason that the Delta-CRDT system’s mean message size remain the same even though the disconnect rate is increased is because when disconnected, nodes enable other nodes to send smaller states. The nodes that are disconnected are not adding to the delta-states that the nodes are sending to each other. Taking this to an extreme would be when there are only two out of all nodes that are not 39 4. Experimental Study disconnected. These two nodes will send smaller delta-states as these two will only send whatever new data they are producing. Thus during the time with only two nodes not disconnected the mean message size will become lower in comparison to the Centrally Managed system and State-CRDT system which will not lower their means. Figure 4.5: Size of messages sent by 16 nodes, 5s broadcast interval (except State-CRDT which has 15 second broadcast rate because of crash during testing), 5.4% disconnect rate. The boxes include 25-75 percentiles, the whiskers are max and min with outliers included and the dotted lines represents mean and solid lines mean. The mean value of the Delta-CRDT is skewed due to the snapshots that the Delta- CRDT sends. These snapshots are consistently approximately 3000 bytes in size regardless of the size of the delta states that the nodes are sending thus lowering the mean. The same is applicable to the result of the Centrally Managed system where the slaves send really small messages lowering the mean though not to the same extent as the Delta-CRDT system. However, in the case of the Centrally Managed system the messages sent by the slaves are not overhead messages such as the snapshots the Delta-CRDT system is sending. The mean for the Delta-CRDTs in Figure 4.4 and 4.5 is outside the box compared to the other systems where the mean is located within the box. The box of the Delta-CRDT is very t