EPGTOP: A tool for continuous monitoring of a distributed system
Examensarbete för masterexamen
Computer systems and networks (MPCSN), MSc
Monitoring is fundamental to provide operational support for online systems and has been an integral part of most computer systems for decades. Kernel-level counters maintaining statistics such as the number of accepted/dropped packets are examples for classic monitoring. The ever-increasing number of connected devices has affected the scale of computer systems. Distributed systems are now inherent to most largescale computer systems and require adjustments to existing monitoring algorithms since monitoring statistics are no longer retained locally and must be communicated over a network. Evolved Packet Gateway (EPG) is a performance-critical distributed system responsible for processing mobile broadband data. An EPG system contains cards to process requests and can scale up to thousands of worker processes when running in production. The amount of data generated and transmitted to monitor these processes using traditional methods can overload the network cards in EPG use to communicate with one another and adversely affect the system’s performance. This thesis provides an overview of continuous distributed monitoring and evaluates continuous monitoring algorithms for distributed systems. The thesis presents EPGTOP, a monitoring service developed for continuous monitoring of EPG to asses communication-efficiency of monitoring algorithms. EPGTOP provides two modes of operation: basic and approximate. When running in the basic mode, monitoring data is periodically transmitted to a designed management node. To improve communication-efficiency of monitoring, the approximate mode allows an error threshold to be configured. The threshold is used to adjust the accuracy of system statistics continuously reported by the management node. Furthermore, the thesis discusses adjustments required to monitoring algorithms to integrate them into EPG and provides results for EPGTOP to compare and analyse trade-offs between accuracy and communication-efficiency when continuously monitoring distributed systems. Our results demonstrate that continuous distributed monitoring algorithms are able to improve the efficiency of monitoring significantly by reducing communication costs. Additionally, utilizing larger error thresholds leads to far less monitoring data to be generated, albeit at the expense of accuracy.
Data- och informationsvetenskap , Computer and Information Science