Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation
dc.contributor.author | Josefsson, Måns | |
dc.contributor.author | Wall, Carl-Magnus | |
dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
dc.contributor.examiner | Massimiliano Gulisano, Vincenzo | |
dc.contributor.supervisor | Papatriantafilou, Marina | |
dc.date.accessioned | 2025-02-25T13:41:14Z | |
dc.date.available | 2025-02-25T13:41:14Z | |
dc.date.issued | 2024 | |
dc.date.submitted | ||
dc.description.abstract | In the current era of ever-growing networks of sophisticated sensors and smart devices, there are vast volumes of data created at every moment. Big Data environments such as these necessitate scalable and efficient data pipelines in order to provide near real-time analytics and monitoring. However, with especially high, unbounded data rates, traditional (store-then-process) database procedures and batchbased processing methods are struggling to remain performant. To this end, processing streams of data continuously has become an increasingly appealing approach, targeting low latency, high scalability and real-time data processing. Stream processing tools enable the execution of both stateless and stateful computations on data as it is being transferred through the pipeline, providing opportunities to increase the efficiency of processing procedures such as, e.g., data validation. However, stream processing pipelines can be complex, especially in multi-tenant settings, and it is not always clear how to efficiently approach their implementation. A process such as data validation might be subject to a multitude of requirements that all affect pipeline design and considerations. To investigate such requirements and give detailed insight on how to approach the design of a stream processing pipeline for efficient data validation on unbounded flows of data, a proof of concept pipeline is developed and tested in a case study at Volvo Trucks. The case study involves multi-tenant automotive testing, where the proposed pipeline enables near real-time validation of data, for purposes such as monitoring vehicle sensor behavior. The pipeline is comprised of Apache Kafka, for persistent event storing, Apache Flink, for continuous stateful analysis, and Apache Druid, for data serving. Evaluation of the pipeline is performed from the perspective of a set of metrics, namely data completeness, sustainable throughput, latency, scalability and fault tolerance. In order to harmonize the requirements of the pipeline and discern how trade-offs affect performance, various tool-tuning experiments and stress tests are performed. Performance evaluation of the pipeline reveals that in a controlled environment, with limited resources, the minimum throughput requirement of the use case can be sustained, while still achieving sub-second latencies and offering a degree of fault tolerance. The pipeline also shows promise of adapting well to different levels of scale, providing enough headroom for a tenfold increase in data volumes over current demands. | |
dc.identifier.coursecode | DATX05 | |
dc.identifier.uri | http://hdl.handle.net/20.500.12380/309161 | |
dc.language.iso | eng | |
dc.setspec.uppsok | Technology | |
dc.subject | Data pipelines | |
dc.subject | stream processing | |
dc.subject | data completeness | |
dc.subject | latency & throughput | |
dc.subject | fault tolerance | |
dc.subject | scalability | |
dc.subject | data validation | |
dc.title | Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation | |
dc.type.degree | Examensarbete för masterexamen | sv |
dc.type.degree | Master's Thesis | en |
dc.type.uppsok | H | |
local.programme | Computer science – algorithms, languages and logic (MPALG), MSc | |
local.programme | Computer systems and networks (MPCSN), MSc |