You are here
ConsumerSearch.com gets about 5.5M unique visitors each month (and growing). I don't know what server infrastructure they run on, but with the help from Jeremy at Tag1 Consulting, they configured Drupal to rely heavily on memcached and Drupal's built-in aggressive caching mode. Knowing Jeremy, they are probably trying to serve cached pages from disk, rather than from the database.
It is not always easy to scale Drupal -- not because Drupal sucks, but simply because scaling the LAMP stack (including Drupal) takes no small amount of skill. You need to buy the right hardware, install load balancers, setup MySQL servers in master-slave mode, setup static file servers, setup web servers, get PHP working with an opcode cacher, tie in a distributed memory object caching system like memcached, integrate with a content delivery network, watch security advisories for every component in your system and configure and tune the hell out of everything.
Either you can do all of the above yourself, or you outsource it to a company that knows how to do this for you. Both are non-trivial and I can count the number of truly qualified companies on one hand. Tag1 Consulting is one of the few Drupal companies that excel at this, in case you're wondering.
My experience is that MySQL takes the most skill and effort to scale. While proxy-based solutions like MySQL Proxy look promising, I don't see strong signals about it becoming fundamentally easier for mere mortals to scale MySQL.
It is not unlikely that in the future, scaling a Drupal site is done using a radically different model. Amazon EC2, Google App Engine and even Sun Caroline are examples of the hosting revolution that is ahead of us. What is interesting is how these systems already seem to evolve: Amazon EC2 allows you to launch any number of servers but you are pretty much on your own to take advantage of them. Like, you still have to pick the operating system, install and configure MySQL, Apache, PHP and Drupal. Not to mention the fact that you don't have access to a good persistent storage mechanism. No, Amazon S3 doesn't qualify, and yes, they are working to fix this by adding Elastic IP addresses and Availability Zones. Either way, Amazon doesn't make it easier to scale Drupal. Frankly, all it does is making capacity planning a bit easier ...
Then comes along Amazon SimpleDB, Google App Engine and Sun Caroline. Just like Amazon EC2/S3 they provide instant scalability, only they moved things up the stack a level. They provide a managed application environment on top of a managed hosting environment. Google App Engine provides APIs that allow you to do user management, e-mail communication, persistent storage, etc. You no longer have to worry about server management or all of the scale-out configuration. Sun Caroline seems to be positioned somewhere in the middle -- they provide APIs to provision lower level concepts such as processes, disk, network, etc.
Unfortunately for Drupal, Google App Engine is Python-only, but more importantly, a lot of the concepts and APIs don't map onto Drupal. Also, the more I dabble with tools like Hadoop (MapReduce) and CouchDB, the more excited I get, but the more it feels like everything that we do to scale the LAMP stack is suddenly wrong. I'm trying hard to think beyond the relational database model, but I can't figure out how to map Drupal onto this completely different paradigm.
So while the center of gravity may be shifting, I've decided to keep an eye on Amazon's EC2/S3 and Sun's Caroline as they are "relational database friendly". Tools like Elastra are showing a lot of promise. Elastra claims to be the world's first infinitely scalable solution for running standard relational databases in an on-demand computing cloud. If they deliver what they promise, we can instantly scale Drupal without having to embrace a different computing model and without having to do all of the heavy lifting. Specifically exciting is the fact that Elastra teamed up with EnterpriseDB to make their version of PostgreSQL virtually expand across multiple Amazon EC2 nodes. I've already reached out to Elastra, EnterpriseDB and Sun to keep tabs on what is happening.
Hopefully, companies like Elastra, EnterpriseDB, Amazon and Sun will move fast because I can't wait to see relational databases live in the cloud ...
As explained in an earlier blog post, we recently started using MySQL master-slave replication on drupal.org in order to provide the scalability necessary to accommodate our growing demands. With one or more replicas of our database, we can instruct Drupal to distribute or load balance the SQL workload among different database servers.
MySQL's master-slave replication is an asynchronous replication model. Typically, all the mutator queries (like INSERT, UPDATE, DELETE) go to a single master, and the master propagates all updates to the slave servers without synchronization or communication. While the asynchronous nature has its advantages, it is also means that the slaves might be (slightly) out of sync.
Consider the following pseudo-code:
$nid = node_save($data);
$node = node_load($nid);
node_save() executes a mutator query (an INSERT or UPDATE statement) is has to be executed on the master, so the master can propagate the changes to the slaves. Because
node_load() uses a read-only query, it can go to the master or any of the available slaves. Because of the lack of synchronization between master and slaves, there is one obvious caveat: when we execute
node_load() the slaves might not have been updated. In other words, unless we force
node_load() to query the master, we risk not being able to present the visitor the data that he just saved. In other cases, we risk introducing data inconsistencies due to the race conditions.
So what is the best way to fix this?
- Our current solution on drupal.org is to execute all queries on the master, except for those that we know can't introduce race conditions. In our running example, this means that we'd chose to execute all
node_load()s on the master, even in absence of a
node_save(). This limits our scalability so this is nothing but a temporary solution until we have a good solution in place.
- One way to fix this is to switch to a synchronous replication model. In such a model, all database changes will be synchronized across all servers to ensure that all replicas are in a consistent state. MySQL provides a synchronous replication model through the NDB cluster storage engine. Stability issues aside, MySQL's cluster technology works best when you avoid JOINs and sub-queries. Because Drupal is highly relational, we might have to rewrite large parts of our code base to get the most out of it.
- Replication and load balancing can be left to some of the available proxy layers, most notably Continuent's Sequoia and MySQL Proxy. Drupal connects to the proxy as if it was the actual database, and the proxy talks to the underlying databases. The proxy parses all the queries and propagates mutator queries to all the underlying databases to make sure they remain in a consistent state. Reads are only distributed among the servers that are up-to-date. This solution is transparent to Drupal, and should work with older versions of Drupal. The only downside is that it not trivial to setup, and based on my testing, it requires quite a bit of memory.
- We could use database partitioning, and assign data to different shards in one way or another. This would reduce the replica lag to zero but as we don't have that much data or database tables with millions of rows, I don't think partitioning will buy drupal.org much.
- Another solution is to rewrite large parts of Drupal so it is "replication lag"-aware. In its most naive form, the
node_load()function in our running example would get a second parameter that specifies whether the query should be executed on the master or not. The call should then be changed to
node_load($nid, TRUE)when proceeded by a
I already concluded through research that this is not commonly done; probably because such a solution still doesn't provide any guarantees across page request.
- A notable exception is MediaWiki, the software behind Wikipedia which has some documented best practices to deal with replication lag. Specifically, they recommend to query the master (using a very fast query) to see what version of the data they have to retrieve from the slave. If the specified version is not yet available on the slave due to replication lag, they simply wait for it to become available. In our running example, each node should get a version number and
node_load()would first retrieve the latest version number from the master and then use that version number to make sure it gets an up-to-date copy from the slave. If the right version isn't yet available,
node_load()will try loading the data again until it becomes available.
Yahoo! released YSlow, a Firefox extension that integrates with the popular Firebug tool. YSlow was originally developed as an internal tool at Yahoo! with the help of Steve Souders, Chief Performance at Yahoo! and author of O'Reilly's High Performance Websites book.
YSlow analyzes the front-end performance of your website and tells you why it might be slow. For each component of a page (images, scripts, stylesheets) it checks its size, whether it was gzipped, the Expires-header, the ETag-header, etc. YSlow takes all this information into account and computes a performance grade for the page you are analyzing.
The current <a href="http://developer.yahoo.com/yslow/">YSlow</a> score for the <a href="http://drupal.org">drupal.org front page</a> is 74 (C). YSlow suggests that we reduce the number of CSS background images using <a href="http://alistapart.com/articles/sprites">CSS sprites</a>, that we use a Content Delivery Network (CDN) like <a href="http://akamai.com">Akamai</a> for delivering static files, and identifies an Apache configuration issue that affects the <em>Entity Tags</em> or <em>ETags</em> of static files. The problem is that, by default, Apache constructs ETags using attributes that make them unique to a specific server. A stock Apache embeds <em>inode numbers</em> in the ETag which dramatically reduces the odds of the validity test succeeding on web sites with multiple servers; the ETags won't match when a browser gets the original component from server A and later tries to validate that component on server B.
Here are some other YSlow scores (higher is better):
- http://wordpress.org: 78 (C)
- http:/drupal.org: 74 (C)
- http://plone.org: 64 (D)
- http://postnuke.com: 63 (D)
- http://typo3.org: 56 (F)
- http://mamboserver.com: 56 (F)
- http://joomla.org: 53 (F)
From what I have seen, Apache configuration issues, and not CMS implementation issues, are the main source of low YSlow scores. Be careful not to draw incorrect conclusions from these numbers; they are often not representative for the CMS software itself.
And it doesn't change the fact that drupal.org is currently a lot slower than most of these other sites. That is explained by drupal.org's poor back-end performance, and not by the front-end performance as measured by YSlow. (We're working on adding a second database server to drupal.org.)
To deal with Drupal's growth, we're adding a second database server to drupal.org which is useful for at least two reasons. First, we'll be able to handle more SQL queries as we can distribute them between multiple database servers (load balancing). Secondly, this new server can act as a "hot spare" that can immediately take over if the other database server fails (high availability / fail-over).
The current plan is to configure both database servers in master-slave configuration, which is the most common replication model for websites. This model provides scalability, but not necessarily fail-over. With a master-slave configuration, all data modification queries (like
DELETE queries) are sent to the master. The master writes updates to a binary log file, and serves this log file to the slaves. The slaves read the queries from the binary log, and execute them against their local copy of the data. While all data modification queries go to the master, all the read-only queries (most notably the
SELECT queries) can be distributed among the slaves. By following this model, we can spread the workload amongst multiple database servers. And as Drupal.org's traffic continues to grow, we can scale horizontally by adding more slaves.
While MySQL does the database replication work, it doesn't do the actual load balancing work. That is up to the application or the database abstraction layer to implement. To be able to distribute queries among multiple database servers, the application needs to distinguish between data modification queries and read-only queries.
Care needs to be taken, as the data on the slaves might be slightly out of sync. It may or may not be practical to guarantee a low-latency environment. In those cases, the application might want to require that certain read-only queries go to the master.
There are different ways to accomplish this:
- Drupal executes all SQL queries through
db_query(). Traditionally, big Drupal sites manually patched
db_query()to use query parsing (regular expression foo) to separate read queries from write queries. This is not convenient and it doesn't provide a good solution to deal with lag. Fortunately, work is being done to provide better support for database replication in Drupal 6. It's our intend to backport this to Drupal 5 so we can use it on drupal.org until Drupal 6 has been released and drupal.org has been upgraded to use Drupal 6.
- MediaWiki, the software behind Wikipedia uses $db->select() and $db->insert(). They have some documented best practices to deal with lag.
- Neither the Pear DB database abstraction layer or its successor Pear MDB2 seem to support database replication.
- Wordpress uses HyperDB, a drop-in replacement for Wordpress' default database abstraction layer that provides support for replication. It was developed for use on Wordpress.com, a mass hosting provider for Wordpress blogs. Because HyperDB is a drop-in replacement, they don't have a clean API and just like Drupal 5, they have to use query parsing to separate read queries from write queries. It's not clear how they deal with lag.
- Joomla! 1.0 does not separate read queries from write queries, but Joomla! 1.5 will use functions like
$db->updateObject(). Joomla! 1.5 won't support replication out of the box, but their API allows a clean drop-in replacement to be developed. It's not clear how they would deal with lag.
- PostNuke uses the ADOdb database abstraction library, which at first glance does not support database replication either.
- Java applications use the statement.executeQuery() and statement.executeUpdate() that are part of the standard class libraries. It's not clear how they deal with lag.
What other popular applications support database replication, and what do their APIs look like? I'd like to find out what the ideal API looks like so we can still push that for inclusion in Drupal 6.
Based on the research above, I think we should get the best of all worlds by introducing these three functions (and deprecating
db_select_slave(); // results might be slightly out of date
db_select_master(); // results always up to date
db_update(); // for UPDATE, INSERT, DELETE, etc
Even if they don't actually do anything useful in Drupal 6, and just map onto the deprecated
db_query(), they set a better standard, and they allow for a drop-in replacement to be developed during Drupal 6's lifetime. Ideally, however, Drupal 6 would ship with a working implementation.
With the release of Drupal 5, you might be wondering which version of Drupal is faster -- the latest release in the Drupal 4 series, or the new Drupal 5?
I setup a Drupal 4.7 site with 2,000 users, 5,000 nodes, 5,000 path aliases, 10,000 comments and 250 vocabulary terms spread over 15 vocabularies.
Next, I configured the main page to show 10 nodes, enabled some blocks in both the left and the right sidebar, setup some primary links, and added a search function at the top of the page. I also setup a contact page using Drupal's contact module. The image below depicts how my final main page was configured.
Furthermore, I made an exact copy of the Drupal 4.7 site and upgraded it to the latest Drupal 5 release. The result is two identical websites; one using Drupal 4.7 and one using Drupal 5.
Benchmarks were conducted on a 3 year old Pentium IV 3Ghz with 2 GB of RAM running Gentoo Linux. I used a single tier web architecture with the following software: Apache 2.0.58, PHP 5.1.6 with APC, and MySQL 5.0.26. No special configuration or tweaking was done other than what was strictly necessary to get things up and running. My setup was CPU-bound, not I/O-bound or memory-bound.
Apache's ab2 with 20 concurrent clients was used to compute how many requests per second the above setup was capable of serving.
Drupal page caching
Drupal has a page cache mechanism that stores dynamically generated web pages in the database. By caching a web page, Drupal does not have to create the page each time it is requested. Only pages requested by anonymous visitors (users that have not logged on) are cached. Once users have logged on, caching is disabled for them since the pages are personalized in various ways. On some websites, like this weblog, everyone but me is an anonymous visitor, while on other websites there might be a good mix of both anonymous and authenticated visitors.
When presenting the benchmark results, I'll make a distinction between cached pages and non-cached paged. This will allow you to interpret the results with the dynamics of your own Drupal websites in mind.
Furthermore, a Drupal 5 installation has two caching modes: normal database caching and aggressive database caching. The normal database cache is suitable for all websites and does not cause any side effects. The aggressive database cache causes Drupal to skip the loading (init-hook) and unloading (exit-hook) of enabled modules when serving a cached page. This results in an additional performance boost but can cause unwanted side effects if you skip loading modules that shouldn't be skipped.
Through contributed modules, Drupal also supports file caching which should outperform aggressive database caching. I have not looked at file caching or any other caching strategies that are made available through contributed modules.
The number of pages Drupal can serve per second. Higher bars are better.
The figure above shows that generating a page in Drupal 5 is 3% slower than in Drupal 4.7. However, when serving a cached page using the normal database cache, Drupal 5 is 73% faster than Drupal 4.7, and 268% faster when the aggressive database cache is used.
What does this mean when looking at the overall performance of a Drupal 5 website? Well, the effectiveness of Drupal's page cache depends on a number of parameters like your cache expiration time, the number of authenticated users, access patterns, etc. To emulate different Drupal configurations, we modified Drupal 4.7 and Drupal 5 so we could look at performance for a range of page cache miss rates.
The relative performance improvement of Drupal 5's normal database caching compared to Drupal 4.7's database caching. A miss rate of 0% means that all page requests result in a cache hit and that all pages can be served from the database cache. A miss rate of 100% means that all page requests result in a cache miss, and that we had to dynamically generate all pages.
The figure above shows the relative performance improvement of Drupal 5 compared to Drupal 4.7. We observe that Drupal sites with relatively few cache misses (typically static Drupal websites accessed by anonymous users) will be significantly faster with Drupal 5. However, Drupal sites where more than 1 out of 2 page requests results in a cache miss (typically dynamic Drupal websites with a lot of authenticated users) will be slightly slower compared to an identical Drupal 4.7 website.
To me these graphs suggest that for most Drupal websites, upgrading to Drupal 5 will yield at least a small performance improvement -- especially if you properly configure your page cache's expiration time. Furthermore, they suggest that for Drupal 6, we need to look at improving the page generation time of non-cached pages. Let's make that an action item.
I used the "Apache, mod_php, PHP4, APC" configuration from previous benchmark experiments to compare the performance of Drupal and Joomla on a 3 year old Pentium IV 3Ghz with 2 GB of RAM running Gentoo Linux. I used the following software: Apache 2.0.55, PHP 4.4.2, MySQL 4.1.4, Drupal 4.7.3 and Joomla 1.0.10.
I simply downloaded and installed the latest stable release of both Drupal and Joomla, and tried my best to make them act and look the same. To do so, I enabled the login form and the "Who's online" block. I also setup two links and a search widget in the top menu, enabled the hit counters for posts, and setup identical footers. Next, I created one author, one category and one post as shown in the images below.
Apache's ab2 was used to compute how many requests per second both systems are capable of serving. The page was requested 1000 times with a concurrency of 5 (i.e.
ab2 -n 1000 -c 5). To test the impact of gzip-compressing pages we specified whether ab2 can accept gzip-compressed pages (i.e.
ab2 -n 1000 -c 5 -H "Accept-Encoding: gzip;"). Note that
ab2 did not request any images or CSS files; only the dynamically generated HTML document was retrieved.
Requests per second
When caching is disabled Joomla can serve 19 pages per second, while Drupal can serve 13 pages per second. Hence, Joomla is 44% faster than Drupal.
However, when caching is enabled Joomla can serve 21 pages per second, while Drupal can serve 67 pages per second. Here, Drupal is 319% faster than Joomla.
In other words, Joomla's cache system improves performance by 12%, while Drupal's cache system improves performance by 508%.
It is important to note that Drupal can only serve cached pages to anonymous visitors (users that have not logged on). Once users have logged on, caching is disabled for them since the pages are personalized in various ways. Hence, in practice, Drupal might not be 319% faster than Joomla; it depends on the ratio of anonymous visitors versus authenticated visitors, how often your site's page cache is flushed, and the hit-rate of your Drupal page cache.
Lastly, when serving gzip-compressed pages Drupal becomes slightly faster compared to having to serve non-compressed pages. Joomla, on the other hand, becomes a little bit slower. The reason is that Drupal's page cache stores its content directly in a compressed state; it has to uncompress the page when the client does not support gzip-compression, but can serve a page directly from the page cache when the client does support gzip-compression.
The first figure shows that the cost of compressing or uncompressing pages is neglible. The second picture shows that it can, however, have significant impact on the document length, and hence, on bandwidth usage.
Drupal always attempts to send compressed pages. Joomla, on the other hand, doesn't compress pages unless this option is explicitly turned on.