Interactive Fine-Grained Provenance for Streaming-based Analysis Applications

Erlandsson, Andréas; Gordani Shahri, Mikael

Interactive Fine-Grained Provenance for Streaming-based Analysis Applications

Ladda ner

CSE 21-09 Erlandsson Gordani Shahri.pdf (5.22 MB)

Publicerad

2021

Författare

Erlandsson, Andréas

Gordani Shahri, Mikael

Typ

Examensarbete för masterexamen

Sammanfattning

Streaming-based applications that process unbounded continuous streams of data, such as user activity on the web or sensor data, can be designed to detect critical events. With such an event, an application can benefit from maintaining the associated source data for further analysis. This can be achieved by fine-grained data provenance, which links each event back to the source data contributing to it. In this thesis, the focus is on the current state-of-the-art data provenance technique called GeneaLog, which collects fine-grained data for cyber-physical systems and maintains it with low overhead. Generating provenance could be a heavy operation in certain applications, where the overhead produced will not always be negligible. Adjusting GeneaLog to become operational with the occurrence of a critical event, as opposed to always being operational, can be beneficial as it can reduce the unnecessary provenance generation. The goal is to extend GeneaLog to generate provenance information interactively and evaluate during what conditions such an extension becomes beneficial. With this, GeneaLog and consequently data provenance techniques could be further introduced to a wider range of devices and applications, as it might reduce processing and memory overhead. In this thesis, an extension for GeneaLog is proposed called Twins. To be able to activate and deactivate GeneaLog, Twins introduces a system which consists of two queries and a pair of special operators. The first query is equipped with standard operators and the second query with operators that generates provenance information. Initially, the first query processes tuples until a critical event is produced, which initiates a transition to the other query. With an absence of critical events after a transition, a transition is made back to the first query. This is performed by the special operators called the Ward operators, which are responsible to trigger and perform a transition between the queries. A prototype of GeneaLog was used and extended in this thesis, which was built for the Stream Processing Engine Apache Flink. During the evaluation, the observed throughput of Twins resembled that of GeneaLog when provenance was active and that of a baseline query with no provenance generation when provenance was inactive. The preliminary results indicate that Twins can be beneficial in scenarios where generating provenance is not a negligible operation in terms of overhead.

Ämne/nyckelord

data analytics, apache flink, streaming, data provenance

URI

https://hdl.handle.net/20.500.12380/302287

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Interactive Fine-Grained Provenance for Streaming-based Analysis Applications

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced