Interactive Fine-Grained Provenance for Streaming-based Analysis Applications
Typ
Examensarbete för masterexamen
Program
Publicerad
2021
Författare
Erlandsson, Andréas
Gordani Shahri, Mikael
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Streaming-based applications that process unbounded continuous streams of data,
such as user activity on the web or sensor data, can be designed to detect critical
events. With such an event, an application can benefit from maintaining the associated
source data for further analysis. This can be achieved by fine-grained data
provenance, which links each event back to the source data contributing to it.
In this thesis, the focus is on the current state-of-the-art data provenance technique
called GeneaLog, which collects fine-grained data for cyber-physical systems and
maintains it with low overhead. Generating provenance could be a heavy operation
in certain applications, where the overhead produced will not always be negligible.
Adjusting GeneaLog to become operational with the occurrence of a critical
event, as opposed to always being operational, can be beneficial as it can reduce the
unnecessary provenance generation.
The goal is to extend GeneaLog to generate provenance information interactively and
evaluate during what conditions such an extension becomes beneficial. With this,
GeneaLog and consequently data provenance techniques could be further introduced
to a wider range of devices and applications, as it might reduce processing and
memory overhead.
In this thesis, an extension for GeneaLog is proposed called Twins. To be able to
activate and deactivate GeneaLog, Twins introduces a system which consists of two
queries and a pair of special operators. The first query is equipped with standard
operators and the second query with operators that generates provenance information.
Initially, the first query processes tuples until a critical event is produced,
which initiates a transition to the other query. With an absence of critical events
after a transition, a transition is made back to the first query. This is performed
by the special operators called the Ward operators, which are responsible to trigger
and perform a transition between the queries.
A prototype of GeneaLog was used and extended in this thesis, which was built for
the Stream Processing Engine Apache Flink. During the evaluation, the observed
throughput of Twins resembled that of GeneaLog when provenance was active and
that of a baseline query with no provenance generation when provenance was inactive.
The preliminary results indicate that Twins can be beneficial in scenarios
where generating provenance is not a negligible operation in terms of overhead.
Beskrivning
Ämne/nyckelord
data analytics , apache flink , streaming , data provenance