An exploratory study of trade-offs in traditional vs. serverless stream processing

Publicerad

Författare

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Stream is the natural form of data that is in a perpetual process of being generated. Stream processing is a way to draw valuable insights from a data stream. With the rapid increase in data volumes primarily driven by IoT devices, stream processing has emerged as a practical approach for data processing. Some characteristics, such as volumes of data and their distribution, can vary over time, leading to changes in the computational requirements of such streaming applications. To be able to adjust frameworks used to the changing requirements, elasticity is needed. As traditional frameworks commonly used to run streaming processing applications, known as Stream Processing Engines (SPE) are not flexible enough, there is often some degree of over-provisioning. It means that the allocated resources are greater than required and remain unutilized. Alternative approaches, such as serverless, can ease scalability, but there are both pros and cons to the approach that this work delves into. This work has implemented a SPE-like API for serverless framework and with its help explores the differences between traditional and serverless models of stream processing engines using Apache Flink and Apache OpenWhisk. The study shows that OpenWhisk can be used for implementing and executing streaming applications similar to those run by Flink. By correctly implementing the logic and code, a behavior similar to Flink’s can be achieved in OpenWhisk. The serverless nature of OpenWhisk, with its pay-per-use pricing model, allows for reduced costs when the framework remains idle. Performance evaluation was performed using a stateless application type (does not require the state of the application to be preserved across multiple executions) utilizing map() API. Also, a stateful type of application (requires the state of the application to be preserved across multiple executions) was evaluated using windowAll() API with sum aggregate. The findings indicate a latency increase of 300-400% in the most intensive test cases and lowered throughput to 50% for OpenWhisk compared to Flink. Conclusions that can be drawn reveal that Flink exhibits greater capacity and performance compared to OpenWhisk for comparable workloads. Flink’s extensive resource base, including APIs and support resources, makes it easier to develop applications and positions it as a robust and well-established solution. On the other hand, OpenWhisk is best suited for projects that do not require rich stream processing libraries or explicit state management. Its high-level scalability abstraction, utilizing Kubernetes, simplifies scaling operations. Both frameworks can be configured to act similarly, with various benefits and tradeoffs depending on an individual use case.

Beskrivning

Ämne/nyckelord

Stream, data, serverless, Flink, OpenWhisk, latency, throughput

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced