Low-Latency Anomaly Detection using Stream Processing
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
To ensure the continuous operation of online services, it is important to be able to
quickly detect system failures. This can be done by monitoring metrics, such as the
number of logins or errors per hour, for unexpected behaviour. These unexpected
behaviours, also known as anomalies, can indicate that something in the system is
not working as intended, which makes it important to be able to detect them with
low latency. In this thesis, we researched how anomalies can be detected in metrics
with low latency using stream processing, a data processing paradigm in which data
is processed as continuous streams of events. This thesis was conducted at Spotify,
one of the largest audio streaming platforms in the world.
To research low-latency anomaly detection using stream processing, we implemented
Harpooner – a stream processing-based counterpart to an existing batch-processingbased
anomaly detection system at Spotify. Harpooner analyses metrics in segments,
which are subsets of users, and detects anomalies on an hourly basis. Anomalies
are detected using the Kolmogorov-Smirnov (K-S) test, a statistical test that can
be used to determine if two samples are drawn from the same underlying distribution.
Harpooner was implemented using Apache Beam, a programming model
for expressing stream processing pipelines. It was implemented in various versions
which weighed trade-offs between implementation simplicity, data storage and computational
complexity of the K-S test.
Harpooner consists of two parts: a metric calculation part, which is identical in all
versions; and an anomaly detection part, which is different in all versions. These
parts were evaluated separately using data from Spotify to ensure semantic equivalence
between Harpooner and the existing system, and synthetic data to measure
their scalability. During evaluation, it was shown that the most efficient anomaly
detection part was able to detect anomalies in a metric with 6,000 segments with a
latency below 10 seconds when run on a single node on Cloud Dataflow, and that
in a real setting the metric calculation part would be the bottleneck of the pipeline.
However, if the two parts were deployed as two separate pipelines, our preliminary
results indicate that Harpooner would be able to scale to handle the load necessary
to do anomaly detection in metrics at Spotify.
Beskrivning
Ämne/nyckelord
Stream processing, Anomaly detection, Apache Beam, Streaming systems, Kolmogorov-Smirnov test