Caching Data for Scalability and Performance

Caching is Easy

In the PHP world when people talk about scalability, caching is inevitably mentioned as a means to increasing the performance of a site. Of course setting up and using something like memcache is pretty trivial. If you want to call something the hardest part with it, you do have to figure out how to install memcached as well as the php module for memcache. And using memcache is about this hard:

$memcache = new Memcache;
$memcache->connect('localhost', 11211);
 
$memcache->set('key', 'foo');
 
echo $memcache->get('key');

Which is awesome if you want to just want to store results from a news feed or want to implement a session handler other than using a database.

Caching Query Results is not Easy

Caching database results on the other hand, presents a number of difficulties:

  1. For one thing, many ORM implementations store either the database resource or a non-serializable result set or both in the actual ORM objects. Since caching requires that all objects be serialized, this puts a serious road block in our way.
  2. How do you track one set of data versus another in the cache?
  3. How do you notify the cache that a particular data set has changed?

Which is why, in general, I’ve found that when people talk about using memcache with data sets, you generally see examples like this:

<?php
$memcache = new Memcache;
$memcache->connect('127.0.0.1', 11211) or die ("Could not connect");
 
include('./db.php');
 
$key = md5("SELECT * FROM memc where FirstName='Memory'");
$get_result = array();
$get_result = $memcache->get($key);
 
if ($get_result) {
        echo "<pre>n";
        echo "FirstName: " . $get_result['FirstName'] . "n";
        echo "LastName: " . $get_result['LastName'] . "n";
        echo "Age: " . $get_result['Age'] . "n";
        echo "Retrieved From Cachen";
        echo "&lt;/pre&gt;";
 
} else {
 
        // Run the query and get the data from the database then cache it
        $query="SELECT * FROM memc where FirstName='Memory';";
        $result = mysql_query($query);
 
        $row = mysql_fetch_array($result);
        echo "<pre>n";
        echo "FirstName: " . $row[1] . "n";
        echo "LastName: " . $row[2] . "n";
        echo "Age: " . $row[3] . "n";
        echo "Retrieved from the Databasen";
        echo "&lt;/pre&gt;";
 
        $memcache->set($key, $row, MEMCACHE_COMPRESSED, 20); // Store the result of the query for 20 seconds
 
        mysql_free_result($result);
}

The query is setup, the cached checked and if its empty for that query then the query is run on the database and the result stored. Of course, almost all examples show this being done in a context that would typically mean putting this process in your domain layer everywhere that a query would be run. Not only is this cumbersome, but most examples ignore everything aside from #1 in the list above.

Actually, Caching Query Results isn’t Hard

Where I currently work our director has had a habit of mentioning the need for memcache and how we’d eventually have to ditch using an ORM altogether in order to implement memcache, and something about that sentiment never has sat quite right with me. I mean, it seems like if you can implement an external check on queries and cache them, then couldn’t an ORM library implement an internal caching mechanism to reduce database load? Since I was in the middle of developing the initial beta version of Gacela at the time, I decided to step and see how hard it would be to implement caching into its foundation. And the truth of the matter was it wasn’t hard. I just had to rethink the general architecture of the library so that caching could be supported from the ground up.

Dealing with #1 wasn’t that hard since Gacela already contained DataSources that were separate from the remainder of the objects. There were a couple of other objects that were also using the database handle, and removing their dependence on it and making the DataSources the only place that uses the database was handle was used is something I see as an additional plus. In my opinion one of the places where ORM libraries that use the database schema to discover information fail, is in the fact that they frequently have to pull the same schema information several times during a single request. So a planned feature of Gacela was storing all loaded mappers and resources in a repository. Implementing cache support means that mappers and resources (which generally only change when there is an update to the application) never HAVE to be loaded more than once.

Dealing with #2 seems straightforward if you are going to cache each record from each resource individually, but what about when you want to pull back a collection of records for a parent resource? Do you perform an initial query to pull back the id’s of the related records and then just hit against the database for individual records that aren’t already in the cache?

I ultimately decided to cache data based first on the resource associated to the Mapper being queried and second on the actual query data itself. So the DataSource::query() came to look something like this:

public function query(GacelaDataSourceResource $resource, $query, $args = null)
{
	if($query instanceof Query)  {
		// Using the _lastQuery variable so that we can see the query when debugging
		list($this->_lastQuery['query'], $this->_lastQuery['args']) = $query->assemble();
	} else {
		$this->_lastQuery = array('query' => $query, 'args' => $args);
	}
 
	$key = hash('whirlpool', serialize(array($this->_lastQuery['query'], $this->_lastQuery['args'])));
 
	$cached = $this->_cache($resource->getName(), $key);
 
	// If the query is cached, return the cached data
	if($cached !== false) {
		return $cached;
	}
 
	$stmt = $this->_conn->prepare($this->_lastQuery['query']);
 
	if($stmt->execute($this->_lastQuery['args']) === true) {
		$return = $stmt->fetchAll(PDO::FETCH_OBJ);
		$this->_cache($resource->getName(), $key, $return);
		return $return;
	} .....

Once we were caching unique sets of data based on specific queries for specific resources, fixing #3 became easy after I stumbled upon this suggestion from the guys who develop memcached. Since every DataSource supports exactly three methods for changing the state of data, it wasn’t hard to add the following into the insert(), update(), delete() methods:

if($query->execute($binds)) {
 
	// Increment cache version for the resource being updated, inserted, or deleted	
	$this->_incrementCache($name);
 
}

And then to create the _incrementCache() method:

protected function _incrementCache($name)
{
	// Get the Gacela instance which holds the Memcache instance
	$instance = $this->_singleton();
 
	// Bypass if memcache is not enabled
	if(!$instance->cacheEnabled()) {
		return;
	}
 
	$cached = $instance->cache($name.'_version');
 
	// If there isn't a cache version already, then no need to increment it
	if($cached === false) {
		return;
	}
 
	// Increment the cache version
	$instance->incrementCache($name.'_version');
}

When all is said and done, using a caching solution is very important for scaling in a LAMPP environment and given a little time and effort, caching database results doesn’t have to be rocket science and it can be rolled into an ORM library so that you’re code isn’t littered with manual checks against the cache.

One thought on “Caching Data for Scalability and Performance

  1. Hi, Memcache is great, but I don’t suggest it unless you are using it with something else (like NGNIX or MySql). The PHP APC cache is MUCH faster and does exactly the same job. I’ve profiled this with XDEBUG and it was worth switching. APC also caches you code too, so you get another speed boost.

    Remember that Memcache uses sockets to communicate, which always are going to cost more time.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>