Trading performance for precision in a CRDT-based rate-limiting system
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The past two decades have seen a rise in popularity of distributed cloud services,
where some service is replicated and thus made available from several nodes to
handle high loads of traffic and maintain operations at low latency. This rise in
popularity has increased the need for solutions that enable fair distribution and
limitation of these services’ resources among its users, one of which is referred to
as rate limiting. One of the rate-limiting solutions at Spotify, the company at
which this thesis was conducted, employs optimistic replication and uses replicated
counters to implement a token-bucket algorithm. The counters track the number of
approved requests for users, rate limit the user based on this number, and operate in
a conflict-free manner. The protocol for synchronizing nodes makes use of gossiping
to lazily propagate synchronization messages to keep the costs of communication
low. This solution strongly ties the system’s ability to accurately rate limit users to
the convergence rate between the replicated states.
A shortcoming of this solution is that users can exceed the limit by sending many
requests to multiple nodes within a short period. To address this shortcoming, this
thesis introduces and evaluates an alternative rate-limiting solution that enables the
system to monitor the frequencies of users’ requests continuously. These frequencies
are then used to synchronize data for users that could reach their limit sooner than
their states synchronize with gossiping. To determine the frequencies between users’
requests, the requests’ time of approval are stored as elements in a replicated queue.
To uphold the token-bucket algorithm using this queue, each element additionally
holds a timestamp of when the element should be removed. While storing the approval
time of requests in a replicated queue caused the nodes to require more time
to converge, our experiments have shown that it does not influence the number of
approved requests, i.e., precision, in most cases. However, the precision could be
impacted when dealing with requests sent at a specific rate. To prevent degradation
of the precision, the system’s parameters, for example, the rate at which nodes
synchronize and the number of nodes, have to be set with care. The solution was
shown to allow the rate-limiting system to enjoy the benefits of lazy synchronization
for users who should not be rate limited. The rate-limiting system was simultaneously
able to synchronize faster for misbehaving users, resulting in higher precision,
at the cost of temporary increases to communication costs. Finally, the memory
consumption was reduced to 13.6 % of the previous solution’s memory usage when
dealing with traffic representative of Spotify’s system.
Beskrivning
Ämne/nyckelord
rate limiting, gossip, precision, frequency, replicated queue, continuous distributed monitoring, delta synchronization