Low-Latency Anomaly Detection using Stream Processing

Publicerad

Typ

Examensarbete för masterexamen

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

To ensure the continuous operation of online services, it is important to be able to quickly detect system failures. This can be done by monitoring metrics, such as the number of logins or errors per hour, for unexpected behaviour. These unexpected behaviours, also known as anomalies, can indicate that something in the system is not working as intended, which makes it important to be able to detect them with low latency. In this thesis, we researched how anomalies can be detected in metrics with low latency using stream processing, a data processing paradigm in which data is processed as continuous streams of events. This thesis was conducted at Spotify, one of the largest audio streaming platforms in the world. To research low-latency anomaly detection using stream processing, we implemented Harpooner – a stream processing-based counterpart to an existing batch-processingbased anomaly detection system at Spotify. Harpooner analyses metrics in segments, which are subsets of users, and detects anomalies on an hourly basis. Anomalies are detected using the Kolmogorov-Smirnov (K-S) test, a statistical test that can be used to determine if two samples are drawn from the same underlying distribution. Harpooner was implemented using Apache Beam, a programming model for expressing stream processing pipelines. It was implemented in various versions which weighed trade-offs between implementation simplicity, data storage and computational complexity of the K-S test. Harpooner consists of two parts: a metric calculation part, which is identical in all versions; and an anomaly detection part, which is different in all versions. These parts were evaluated separately using data from Spotify to ensure semantic equivalence between Harpooner and the existing system, and synthetic data to measure their scalability. During evaluation, it was shown that the most efficient anomaly detection part was able to detect anomalies in a metric with 6,000 segments with a latency below 10 seconds when run on a single node on Cloud Dataflow, and that in a real setting the metric calculation part would be the bottleneck of the pipeline. However, if the two parts were deployed as two separate pipelines, our preliminary results indicate that Harpooner would be able to scale to handle the load necessary to do anomaly detection in metrics at Spotify.

Beskrivning

Ämne/nyckelord

Stream processing, Anomaly detection, Apache Beam, Streaming systems, Kolmogorov-Smirnov test

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced