Balancing strict performance requirements and trade-offs for efficient data handling in unbounded flows - Design considerations for a proof-of-concept stream processing pipeline for vehicular data validation
Ladda ner
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Program
Computer science – algorithms, languages and logic (MPALG), MSc
Computer systems and networks (MPCSN), MSc
Computer systems and networks (MPCSN), MSc
Publicerad
2024
Författare
Josefsson, Måns
Wall, Carl-Magnus
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In the current era of ever-growing networks of sophisticated sensors and smart devices,
there are vast volumes of data created at every moment. Big Data environments
such as these necessitate scalable and efficient data pipelines in order to
provide near real-time analytics and monitoring. However, with especially high, unbounded
data rates, traditional (store-then-process) database procedures and batchbased
processing methods are struggling to remain performant. To this end, processing
streams of data continuously has become an increasingly appealing approach,
targeting low latency, high scalability and real-time data processing. Stream processing
tools enable the execution of both stateless and stateful computations on
data as it is being transferred through the pipeline, providing opportunities to increase
the efficiency of processing procedures such as, e.g., data validation. However,
stream processing pipelines can be complex, especially in multi-tenant settings, and
it is not always clear how to efficiently approach their implementation. A process
such as data validation might be subject to a multitude of requirements that all
affect pipeline design and considerations. To investigate such requirements and give
detailed insight on how to approach the design of a stream processing pipeline for
efficient data validation on unbounded flows of data, a proof of concept pipeline
is developed and tested in a case study at Volvo Trucks. The case study involves
multi-tenant automotive testing, where the proposed pipeline enables near real-time
validation of data, for purposes such as monitoring vehicle sensor behavior. The
pipeline is comprised of Apache Kafka, for persistent event storing, Apache Flink,
for continuous stateful analysis, and Apache Druid, for data serving. Evaluation
of the pipeline is performed from the perspective of a set of metrics, namely data
completeness, sustainable throughput, latency, scalability and fault tolerance. In
order to harmonize the requirements of the pipeline and discern how trade-offs affect
performance, various tool-tuning experiments and stress tests are performed.
Performance evaluation of the pipeline reveals that in a controlled environment,
with limited resources, the minimum throughput requirement of the use case can
be sustained, while still achieving sub-second latencies and offering a degree of fault
tolerance. The pipeline also shows promise of adapting well to different levels of
scale, providing enough headroom for a tenfold increase in data volumes over current
demands.
Beskrivning
Ämne/nyckelord
Data pipelines , stream processing , data completeness , latency & throughput , fault tolerance , scalability , data validation