Leaky bucket algorithm implemented on Redis and PHP

One of the biggest problems of web development are programmers not fully aware of the infrastructure they work with. An average PHP developer will not know about the limitations the infrastructure has, the maximum number of connections the web site can handle, etc. The PHP developer just programs on PHP, refreshes the page, runs the unit tests, and if everything is OK, considers his work as finished. However once the application is deployed, everything starts to depend on the actual infrastructure.

In many cases the pages will not be ready to handle an excessive amount of traffic; and will noticeably degrade service quality under heavy -and not so heavy- loads. For example with hihgly used APIs by external developers -who could not be aware of what caching means-, or when the site is attacked by scrapers who want to populate their site with your data, or a botnet.

This is a recurrent problem we have at my office; thus I’ve proposed to start using the well known leaky bucket system to protect our sites from scrapers. My manager liked the initiative, but he wanted to apply it at the infrastructure level and not in the application level. Even when I strongly agree with that as a definitive solution, I think there’s no reason to implement the whole thing at the infrastructure level without knowing how bad is the current situation. What would we say? “Block all IPs making more than X amount of request per minute?” Yes, that’s the idea,but what would be that X?

What I wanted to do, instead, is to apply the leaky bucket at the application level for testing purposes. That will not take too much time, and by logging how many requests and different clients would have been blocked we can get some interesting information to make the definitive implementation at the infrastructure level. Also that would allow us not only to log who will be blocked, but to put some Analytics Event tracking codes. In that way we would see in Analytics how many real users would have been blocked with the specified settings, allowing us to tune it up. Besides the server-side logging, we want to know also which percentage of that are real browsers and not scrapers.

That is how I came up with these small PHP files that make the whole implementation for testing purposes.

The code is splitted in two parts: the request script and the clean-up script. Request is basically the doorman, “yes, come in” or “no, get the f*ck out of here”. The clean-up script is who reviews the list every often and takes some drops out of the bucket. The whole thing uses the Flexihash script for consistent hashing and so splitting the data across many sets -as much as you need. The example is fully dynamic, but you can hardcode the set names to make it faster.

Said that, please get the FlexiHash files from http://github.com/pda/flexihash and set the include line to the proper place.

Then, let’s start with the initial inc.php file which should be included in both request.php and cleanup.php:

include 'flexihash-0.1.9.php';

// Redis extension from nicolasff, thanks bro!
$redis = new Redis();

$hasher = new Flexihash();

$numberOfSets = 10;

for($i = 1; $i <=$numberOfSets; $i++){
	$pad = str_pad($i,strlen($numberOfSets),'0',STR_PAD_LEFT);
	$sets[] = 'set'.$pad; // To create sets like set01, set02;
	// or set0001, set0999 if $numberOfSets is 1000


This part is easy, right? We prepared the Redis connection and the hasher.

Now, let’s see request.php


if(sizeof($argv) < 3){
	die('2 parameters: clientId, actionId; ie A search');

$id = $argv[1];
$action = $argv[2];

$id = $id.'-'.$action;

$set = $hasher->lookup($id);

$period = 30; // In seconds
$limit = 6; // How many hits allowed every $period seconds

$actualHits = $redis->zscore($set, $id);

if($actualHits >= $limit){
	echo "Not allowed.  {$actualHits} hits done, only {$limit} are allowed every {$period} seconds\n";

	// Log that this request would have been locked.
	$redis->zIncrBy('locked', 1, $id);

list($actualHits) = $redis->multi()->zIncrBy($set, 1, $id)->zAdd($set.':control', -1, $id)->exec();
$available = $limit - $actualHits;
echo "Approved, you have {$available} hits\n";

In short, the script will see if the requested client (in most cases will be an IP check) has enough shots to make the requested action. If it does not have shots, we log the action for statistical purposes and fine tuning. If he does have shots, we increase it’s number of requests by 1, we add a -1 to the control set -to be used later on- and we let the client know how many hits are remaining.

Now, let’s see the cleanup script; that should be executed periodically in a cronjob. We’ll go back to that subject later on, no worries.


// Loop across all sets
foreach($sets as $set){
	// Remove all entries with score =< 1 from all sets. This will
	// reduce the size before furthing processing
	echo "Set has ".$redis->zCard($set)." elements, before cleanup.";
	$redis->zRemRangeByScore($set, '-inf', '1');
	echo " Now it has ".$redis->zCard($set).".\n";
	// Remember the control set we created on request.php? That sorted set contains all entries
	// the set has, but with score as -1. The goal of that zset is to reduce by 1 all scores
	// storing user hits, by intersecting set and set:control.
	echo "Control set had ".$redis->zcard($set.':control') . ' before cleanup, now it has ';
	$redis->zinterstore($set.':control', array($set.':control', $set), array(1,0), 'SUM');
	echo $redis->zcard($set.':control') ."\n";
	// Now do the interstore on set by substracting one
	// Remember in the request.php file we add the client to set with score -1?
	// That's to use it with a zInterStore and thus substracting 1 from all
	// scores. The trick here is the aggregation function SUM instead of WEIGHT.
	$redis->zinterstore($set, array($set, $set.':control'), array(1,1), 'SUM');
	echo "Control applied, all entries were reduced by 1";

Well, that’s all! Now the trick is how to run the PHP cleanup script every 5 seconds or so. In my case I will run it every 30 seconds at first; by adding something like this in the crontab:

* * * * * php cleanup.php;
* * * * * sleep 30; php cleanup.php;

Why are these scripts good?
First, are very light. Second, are using sharding to speed up all checks. Third, are reusable.

What can be improved?
1-Make it compatible with different buckets. You might like to have one bucket for APIs, one bucket for login attempts, one bucket for premium customers who could have right to hit the APIs more often.
2-Convert it to OOP 😉 That’s the homework guys, if you convert it to OOP, drop me a line.
3-Apply it in a lower level, so the blocked client does not even hit the web server by being stopped at the network level.
4-Use more than one server -one master and one or more slaves- or many masters in a sharded setup based by set names. Remember Redis is a single threaded server, so you will definitely have advantage by running one instance per free core in your server. In that way your limit won’t be CPU but storage and RAM.

If this small project gets approved, I will apply these checks right after the framework and routing rules are loaded. In that way I will have access to the Redis configuration files, and to set the “action” names in a per-route basis. All categories, keywords, tags and search pages will be grouped under the “listing” action name. Login, register and password reset will be grouped in the “users” action. Embed and on-site player pages will be grouped under the “player” action. Voting, commenting, adding to favorites will be grouped under the “rating” action. This will, for sure, make our sites much more stable and will give better response times to normal users.

Leave a Reply

Your email address will not be published. Required fields are marked *