Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation

Typ
Examensarbete för masterexamen
Master's Thesis
Program
Computer science – algorithms, languages and logic (MPALG), MSc
Computer systems and networks (MPCSN), MSc
Publicerad
2024
Författare
Josefsson, Måns
Wall, Carl-Magnus
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In the current era of ever-growing networks of sophisticated sensors and smart devices, there are vast volumes of data created at every moment. Big Data environments such as these necessitate scalable and efficient data pipelines in order to provide near real-time analytics and monitoring. However, with especially high, unbounded data rates, traditional (store-then-process) database procedures and batchbased processing methods are struggling to remain performant. To this end, processing streams of data continuously has become an increasingly appealing approach, targeting low latency, high scalability and real-time data processing. Stream processing tools enable the execution of both stateless and stateful computations on data as it is being transferred through the pipeline, providing opportunities to increase the efficiency of processing procedures such as, e.g., data validation. However, stream processing pipelines can be complex, especially in multi-tenant settings, and it is not always clear how to efficiently approach their implementation. A process such as data validation might be subject to a multitude of requirements that all affect pipeline design and considerations. To investigate such requirements and give detailed insight on how to approach the design of a stream processing pipeline for efficient data validation on unbounded flows of data, a proof of concept pipeline is developed and tested in a case study at Volvo Trucks. The case study involves multi-tenant automotive testing, where the proposed pipeline enables near real-time validation of data, for purposes such as monitoring vehicle sensor behavior. The pipeline is comprised of Apache Kafka, for persistent event storing, Apache Flink, for continuous stateful analysis, and Apache Druid, for data serving. Evaluation of the pipeline is performed from the perspective of a set of metrics, namely data completeness, sustainable throughput, latency, scalability and fault tolerance. In order to harmonize the requirements of the pipeline and discern how trade-offs affect performance, various tool-tuning experiments and stress tests are performed. Performance evaluation of the pipeline reveals that in a controlled environment, with limited resources, the minimum throughput requirement of the use case can be sustained, while still achieving sub-second latencies and offering a degree of fault tolerance. The pipeline also shows promise of adapting well to different levels of scale, providing enough headroom for a tenfold increase in data volumes over current demands.
Beskrivning
Ämne/nyckelord
Data pipelines , stream processing , data completeness , latency & throughput , fault tolerance , scalability , data validation
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index