Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation

Josefsson, Måns; Wall, Carl-Magnus

Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation

Ladda ner

CSE 24-115 MJ CMW.pdf (1.9 MB)

Publicerad

2024

Författare

Josefsson, Måns

Wall, Carl-Magnus

Typ

Examensarbete för masterexamen
Master's Thesis

Program

Computer science – algorithms, languages and logic (MPALG), MSc
Computer systems and networks (MPCSN), MSc

Sammanfattning

In the current era of ever-growing networks of sophisticated sensors and smart devices, there are vast volumes of data created at every moment. Big Data environments such as these necessitate scalable and efficient data pipelines in order to provide near real-time analytics and monitoring. However, with especially high, unbounded data rates, traditional (store-then-process) database procedures and batchbased processing methods are struggling to remain performant. To this end, processing streams of data continuously has become an increasingly appealing approach, targeting low latency, high scalability and real-time data processing. Stream processing tools enable the execution of both stateless and stateful computations on data as it is being transferred through the pipeline, providing opportunities to increase the efficiency of processing procedures such as, e.g., data validation. However, stream processing pipelines can be complex, especially in multi-tenant settings, and it is not always clear how to efficiently approach their implementation. A process such as data validation might be subject to a multitude of requirements that all affect pipeline design and considerations. To investigate such requirements and give detailed insight on how to approach the design of a stream processing pipeline for efficient data validation on unbounded flows of data, a proof of concept pipeline is developed and tested in a case study at Volvo Trucks. The case study involves multi-tenant automotive testing, where the proposed pipeline enables near real-time validation of data, for purposes such as monitoring vehicle sensor behavior. The pipeline is comprised of Apache Kafka, for persistent event storing, Apache Flink, for continuous stateful analysis, and Apache Druid, for data serving. Evaluation of the pipeline is performed from the perspective of a set of metrics, namely data completeness, sustainable throughput, latency, scalability and fault tolerance. In order to harmonize the requirements of the pipeline and discern how trade-offs affect performance, various tool-tuning experiments and stress tests are performed. Performance evaluation of the pipeline reveals that in a controlled environment, with limited resources, the minimum throughput requirement of the use case can be sustained, while still achieving sub-second latencies and offering a degree of fault tolerance. The pipeline also shows promise of adapting well to different levels of scale, providing enough headroom for a tenfold increase in data volumes over current demands.

Ämne/nyckelord

Data pipelines, stream processing, data completeness, latency & throughput, fault tolerance, scalability, data validation

URI

http://hdl.handle.net/20.500.12380/309161

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced