Trading performance for precision in a CRDT-based rate-limiting system

Typ
Examensarbete för masterexamen
Program
Publicerad
2021
Författare
HENRIKSSON, ANDREAS
BENNHAGE, SVANTE
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The past two decades have seen a rise in popularity of distributed cloud services, where some service is replicated and thus made available from several nodes to handle high loads of traffic and maintain operations at low latency. This rise in popularity has increased the need for solutions that enable fair distribution and limitation of these services’ resources among its users, one of which is referred to as rate limiting. One of the rate-limiting solutions at Spotify, the company at which this thesis was conducted, employs optimistic replication and uses replicated counters to implement a token-bucket algorithm. The counters track the number of approved requests for users, rate limit the user based on this number, and operate in a conflict-free manner. The protocol for synchronizing nodes makes use of gossiping to lazily propagate synchronization messages to keep the costs of communication low. This solution strongly ties the system’s ability to accurately rate limit users to the convergence rate between the replicated states. A shortcoming of this solution is that users can exceed the limit by sending many requests to multiple nodes within a short period. To address this shortcoming, this thesis introduces and evaluates an alternative rate-limiting solution that enables the system to monitor the frequencies of users’ requests continuously. These frequencies are then used to synchronize data for users that could reach their limit sooner than their states synchronize with gossiping. To determine the frequencies between users’ requests, the requests’ time of approval are stored as elements in a replicated queue. To uphold the token-bucket algorithm using this queue, each element additionally holds a timestamp of when the element should be removed. While storing the approval time of requests in a replicated queue caused the nodes to require more time to converge, our experiments have shown that it does not influence the number of approved requests, i.e., precision, in most cases. However, the precision could be impacted when dealing with requests sent at a specific rate. To prevent degradation of the precision, the system’s parameters, for example, the rate at which nodes synchronize and the number of nodes, have to be set with care. The solution was shown to allow the rate-limiting system to enjoy the benefits of lazy synchronization for users who should not be rate limited. The rate-limiting system was simultaneously able to synchronize faster for misbehaving users, resulting in higher precision, at the cost of temporary increases to communication costs. Finally, the memory consumption was reduced to 13.6 % of the previous solution’s memory usage when dealing with traffic representative of Spotify’s system.
Beskrivning
Ämne/nyckelord
rate limiting , gossip , precision , frequency , replicated queue , continuous distributed monitoring , delta synchronization
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index