Leaky bucket algorithm implemented on Redis and PHP

One of the biggest problems of web development are programmers not fully aware of the infrastructure they work with. An average PHP developer will not know about the limitations the infrastructure has, the maximum number of connections the web site can handle, etc. The PHP developer just programs on PHP, refreshes the page, runs the unit tests, and if everything is OK, considers his work as finished. However once the application is deployed, everything starts to depend on the actual infrastructure.

In many cases the pages will not be ready to handle an excessive amount of traffic; and will noticeably degrade service quality under heavy -and not so heavy- loads. For example with hihgly used APIs by external developers -who could not be aware of what caching means-, or when the site is attacked by scrapers who want to populate their site with your data, or a botnet.

This is a recurrent problem we have at my office; thus I’ve proposed to start using the well known leaky bucket system to protect our sites from scrapers. My manager liked the initiative, but he wanted to apply it at the infrastructure level and not in the application level. Even when I strongly agree with that as a definitive solution, I think there’s no reason to implement the whole thing at the infrastructure level without knowing how bad is the current situation. What would we say? “Block all IPs making more than X amount of request per minute?” Yes, that’s the idea,but what would be that X?

What I wanted to do, instead, is to apply the leaky bucket at the application level for testing purposes. That will not take too much time, and by logging how many requests and different clients would have been blocked we can get some interesting information to make the definitive implementation at the infrastructure level. Also that would allow us not only to log who will be blocked, but to put some Analytics Event tracking codes. In that way we would see in Analytics how many real users would have been blocked with the specified settings, allowing us to tune it up. Besides the server-side logging, we want to know also which percentage of that are real browsers and not scrapers.

That is how I came up with these small PHP files that make the whole implementation for testing purposes.

The code is splitted in two parts: the request script and the clean-up script. Request is basically the doorman, “yes, come in” or “no, get the f*ck out of here”. The clean-up script is who reviews the list every often and takes some drops out of the bucket. The whole thing uses the Flexihash script for consistent hashing and so splitting the data across many sets -as much as you need. The example is fully dynamic, but you can hardcode the set names to make it faster.

Said that, please get the FlexiHash files from http://github.com/pda/flexihash and set the include line to the proper place.

Then, let’s start with the initial inc.php file which should be included in both request.php and cleanup.php:

<?
include 'flexihash-0.1.9.php';

// Redis extension from nicolasff, thanks bro!
$redis = new Redis();
$redis->connect('localhost');

$hasher = new Flexihash();

$numberOfSets = 10;

for($i = 1; $i <=$numberOfSets; $i++){
	$pad = str_pad($i,strlen($numberOfSets),'0',STR_PAD_LEFT);
	$sets[] = 'set'.$pad; // To create sets like set01, set02;
	// or set0001, set0999 if $numberOfSets is 1000
}

$hasher->addTargets($sets);

This part is easy, right? We prepared the Redis connection and the hasher.

Now, let’s see request.php

<?
require('inc.php');

if(sizeof($argv) < 3){
	die('2 parameters: clientId, actionId; ie A search');
}

$id = $argv[1];
$action = $argv[2];

$id = $id.'-'.$action;

$set = $hasher->lookup($id);

$period = 30; // In seconds
$limit = 6; // How many hits allowed every $period seconds

$actualHits = $redis->zscore($set, $id);

if($actualHits >= $limit){
	echo "Not allowed.  {$actualHits} hits done, only {$limit} are allowed every {$period} seconds\n";

	// Log that this request would have been locked.
	$redis->zIncrBy('locked', 1, $id);
	die();
}


list($actualHits) = $redis->multi()->zIncrBy($set, 1, $id)->zAdd($set.':control', -1, $id)->exec();
$available = $limit - $actualHits;
echo "Approved, you have {$available} hits\n";

In short, the script will see if the requested client (in most cases will be an IP check) has enough shots to make the requested action. If it does not have shots, we log the action for statistical purposes and fine tuning. If he does have shots, we increase it’s number of requests by 1, we add a -1 to the control set -to be used later on- and we let the client know how many hits are remaining.

Now, let’s see the cleanup script; that should be executed periodically in a cronjob. We’ll go back to that subject later on, no worries.

<?
require('inc.php');

// Loop across all sets
foreach($sets as $set){
	// Remove all entries with score =< 1 from all sets. This will
	// reduce the size before furthing processing
	echo "Set has ".$redis->zCard($set)." elements, before cleanup.";
	$redis->zRemRangeByScore($set, '-inf', '1');
	echo " Now it has ".$redis->zCard($set).".\n";
	
	// Remember the control set we created on request.php? That sorted set contains all entries
	// the set has, but with score as -1. The goal of that zset is to reduce by 1 all scores
	// storing user hits, by intersecting set and set:control.
	echo "Control set had ".$redis->zcard($set.':control') . ' before cleanup, now it has ';
	$redis->zinterstore($set.':control', array($set.':control', $set), array(1,0), 'SUM');
	echo $redis->zcard($set.':control') ."\n";
		
	// Now do the interstore on set by substracting one
	// Remember in the request.php file we add the client to set with score -1?
	// That's to use it with a zInterStore and thus substracting 1 from all
	// scores. The trick here is the aggregation function SUM instead of WEIGHT.
	$redis->zinterstore($set, array($set, $set.':control'), array(1,1), 'SUM');
	echo "Control applied, all entries were reduced by 1";
}

Well, that’s all! Now the trick is how to run the PHP cleanup script every 5 seconds or so. In my case I will run it every 30 seconds at first; by adding something like this in the crontab:

* * * * * php cleanup.php;
* * * * * sleep 30; php cleanup.php;

Why are these scripts good?
First, are very light. Second, are using sharding to speed up all checks. Third, are reusable.

What can be improved?
1-Make it compatible with different buckets. You might like to have one bucket for APIs, one bucket for login attempts, one bucket for premium customers who could have right to hit the APIs more often.
2-Convert it to OOP 😉 That’s the homework guys, if you convert it to OOP, drop me a line.
3-Apply it in a lower level, so the blocked client does not even hit the web server by being stopped at the network level.
4-Use more than one server -one master and one or more slaves- or many masters in a sharded setup based by set names. Remember Redis is a single threaded server, so you will definitely have advantage by running one instance per free core in your server. In that way your limit won’t be CPU but storage and RAM.

If this small project gets approved, I will apply these checks right after the framework and routing rules are loaded. In that way I will have access to the Redis configuration files, and to set the “action” names in a per-route basis. All categories, keywords, tags and search pages will be grouped under the “listing” action name. Login, register and password reset will be grouped in the “users” action. Embed and on-site player pages will be grouped under the “player” action. Voting, commenting, adding to favorites will be grouped under the “rating” action. This will, for sure, make our sites much more stable and will give better response times to normal users.

Coding with style. Excessive style, actually.

We were having a bug where a DIV was being hidden by a asynchronous bar with Javascript on top of the page. So one of my front end developers came up with this solution: put the DIV a bit lower to show it completely. This is coding with a lot of style, isn’t it?

Double style tag? Are you joking me, bro?

Weirdest reason in a technical request email ever: Porn is not allowed in Moldova

Some time ago I was working on a porn search engine project -now with more than 10 thousands visits a day and growing- and I found myself writing the weirdest reason ever to ask for a domain change. The site was www.porn.md ; supposed to be “the porn doctor”. You would write your problem in the search box, and the doctor will give you some remedies. Everything was going good, traffic was going up and everything, until we received a notification from the TLD .md managers: Moldova does not allow porn websites.

We needed to move the project to a different domain: pornmd.com. There was no choice.

I went to the email client and I sent an email to my hosting company, the weirdest reason ever.

I will need to change the domain name from porn.md to pornmd.com because porn is not allowed in Moldova.

Thanks!

Nicolás Andrade

 

“Porn is not allowed in Moldova”. Have you ever wrote a weirder reason than that?

New ‘tax’: Internet Explorer 7 customers will be charged extra

Kogan.com, an australian retailer; has decided to “tax” their customers using Internet Explorer 7 or lower.

Ruslan Kogan, chief executive officer of Kogan.com, explained that they need to find a way to cover the extra cost of designing their pages to look properly in a very old browser.

Every month the tax will be increased 0.1%,because IE7 is one month older.

More information at http://www.bbc.com/news/technology-18440979

“Anyone who is involved with the internet and web technology would know the amount of time that is wasted to support all these antiquated browsers,” Kogan said. “You have to make all these work-arounds all the time to make sure the site works properly on it.”

 

Mod_Redis for Apache

Thanks to Redis Google Group I’ve fallen in this link while browsing the web: https://github.com/sneakybeaky/mod_redis

It’s an Apache module providing an interface to Redis servers. This means that you can access different URLs to set or get values from Redis, avoiding all the overhead PHP, the PHP Redis extension and probably your framework are putting in.

By using different HTTP methods (POST, PUT, DELETE, GET), you can set keys and retrieve keys in a very customizable way.

By default the module returns an XML representation of the resulting Redis response. For example configuring a RedisAlias to support the PING command:

(RedisAlias ^ping$ "PING")

the response would be:

<?xml version="1.0" encoding="UTF-8"?> <response> <status>PONG</status></response> 

This behavior can be altered by appending .json to the url, for example:

{ "status" : "PONG" } 

Alternatively, appending .jsonp and adding a query parameter ‘callback’ respond with:

callback({ "status" : "PONG" });

The Mod_Redis module can be useful to load different values in asynchronous requests after the page has been loaded, for example you can dynamically refresh the viewcounts in the player page every 30 seconds with an Ajax call that will not affect your webserver, or to refresh the number of comments a video had.

There’s also a Nginx module providing the same features in https://github.com/agentzh/redis2-nginx-module

SEO Tips: Zero Page Results

Search Engine Optimization is one of the musts for a web site. There are many ways to keep good SEO, like those that we all know: only one H1 in each document, using good keywording, etc.

For websites offering a search feature, it’s good to show some links to common or important searches. The common “cloud of tags” is a good example; they provide links to page with search results.

But when happens when we don’t have any result on such pages? Let’s say we have a tag “expired”, and we have there all the documents, products or posts that we consider expired, but we still want to be available to the users, so we keep the content on line. In this case the search for “expired” will return documents. One good day we decide to not show those results anymore, so the page /search/expired, which was already indexed by all search engines, now will simply show “Sorry, but there are no results for your query”.

To solve that we have an excellent tool that we can use. Instead of returning the “200 OK header”, we’ll return “404 Not found”, but we’ll still show a page saying “There are no results, but a lot of suggestions”. Then we’ll offer different options, but our page won’t be indexed anymore in Google.

That’s it… when no results are found for an indexed search URL, let’s say when Google tries to index a ZRP -Zero result page-, you should reply with a list of suggestions and a nice “404 not found” header:

header(“HTTP/1.0 404 Not Found”);

This will ensure that no search engine will index that page.