Interactive Fine-Grained Provenance for Streaming-based Analysis Applications
Examensarbete för masterexamen
Gordani Shahri, Mikael
Streaming-based applications that process unbounded continuous streams of data, such as user activity on the web or sensor data, can be designed to detect critical events. With such an event, an application can benefit from maintaining the associated source data for further analysis. This can be achieved by fine-grained data provenance, which links each event back to the source data contributing to it. In this thesis, the focus is on the current state-of-the-art data provenance technique called GeneaLog, which collects fine-grained data for cyber-physical systems and maintains it with low overhead. Generating provenance could be a heavy operation in certain applications, where the overhead produced will not always be negligible. Adjusting GeneaLog to become operational with the occurrence of a critical event, as opposed to always being operational, can be beneficial as it can reduce the unnecessary provenance generation. The goal is to extend GeneaLog to generate provenance information interactively and evaluate during what conditions such an extension becomes beneficial. With this, GeneaLog and consequently data provenance techniques could be further introduced to a wider range of devices and applications, as it might reduce processing and memory overhead. In this thesis, an extension for GeneaLog is proposed called Twins. To be able to activate and deactivate GeneaLog, Twins introduces a system which consists of two queries and a pair of special operators. The first query is equipped with standard operators and the second query with operators that generates provenance information. Initially, the first query processes tuples until a critical event is produced, which initiates a transition to the other query. With an absence of critical events after a transition, a transition is made back to the first query. This is performed by the special operators called the Ward operators, which are responsible to trigger and perform a transition between the queries. A prototype of GeneaLog was used and extended in this thesis, which was built for the Stream Processing Engine Apache Flink. During the evaluation, the observed throughput of Twins resembled that of GeneaLog when provenance was active and that of a baseline query with no provenance generation when provenance was inactive. The preliminary results indicate that Twins can be beneficial in scenarios where generating provenance is not a negligible operation in terms of overhead.
data analytics , apache flink , streaming , data provenance