You are here

LAMP stack Halloween cake

Barry Jaspan and his wife Heather spent 20 hours creating this incredible cake for Acquia's Halloween party. Creative duo! Not only did it look great, it was yummy. Trick or treat!

Halloween
Halloween

The history of MySQL AB

MySQL, the open source database product that puts the "M" in LAMP, was created by MySQL AB, a company founded in 1995 in Sweden. In 2008, MySQL AB announced that it had agreed to be acquired by Sun Microsystems for $1 billion.

The story of MySQL AB is pretty amazing, so I unleashed my "inner academic", did some research and compiled a timeline of MySQL AB's history. This timeline is assembled based on different resources online, such as MySQL press releases (example 1) and interviews with MySQL AB executives (example 2, example 3), etc.

Things to add? Let me know in the comments and I'll update the post.

1995

  • The MySQL project was started by Michael Widenius (Monty), David Axmark and Allan Larsson. For the first 6 years it was commercially represented by TCX, a company owned by one of the 3 founders. Monty was in Finland, David and Allan in Sweden.

2000

  • MySQL goes Open Source and releases software under the terms of the GPL. Revenues dropped 80% as a result, and it took a year to make up for it.

2001

  • Mårten Mickos elected CEO at age 38. Mårten was the CEO of a number of Nordic companies before joining MySQL, and comes with a sales and marketing background.
  • 2 million active installations.
  • Raised a €4 million series A from Scandinavian venture capitalists.
  • MySQL AB and its founders were sued by Progress / NuSphere. MySQL AB countersued as Progress had violated the GPL license.

2002

  • MySQL launched US headquarters in addition to Swedish headquarters.
  • 3 million active users.
  • Ended the year with $6.5 million in revenue with 1,000 paying customers.

2003

  • Raised a $19.5 million series B from Benchmark Capital and Index Ventures. Their Series B pitch deck is available online.
  • 4 million active installations and over 30,000 downloads per day.
  • Ended the year with $12 million in revenue.
  • MySQL AB enters into a partnership with SAP (which will end in 2007). Several features in MySQL 5 were produced with SAP in mind.

2004

  • With the main revenue coming from the OEM dual-licensing model, MySQL decides to move more into the enterprise market and to focus more on recurring revenue from end users rather than one-time licensing fees from their OEM partners.
  • Ended the year with $20 million in revenue.

2005

  • MySQL launched the MySQL Network modeled after the RedHat Network. The MySQL Network is a subscription service targeted at end users that provides updates, alerts, notifications, and product-level support designed to make it easier for companies to manage hundreds of MySQL servers.
  • MySQL 5 ships and includes many new features to go after enterprise users (e.g. stored procedures, triggers, views, cursors, distributed transactions, federated storage engines, etc.)
  • Oracle buys Innobase, the 4-person Finnish company behind MySQL's InnoDB storage backend.
  • Ended the year with $34 million in revenue based on 3400 customers.

2006

  • Mårten Mickos confirms that Oracle tried to buy MySQL. Oracle' CEO Larry Ellison commented: "We've spoken to them, in fact we've spoken to almost everyone. Are we interested? It's a tiny company. I think the revenues from MySQL are between $30 million and $40 million. Oracle's revenue next year is $15 billion."
  • Oracle buys Sleepycat, the company that provides MySQL with the Berkeley DB transactional storage engine.
    For the most part, the Sleepycat acquisition was a non-event for MySQL AB and users of MySQL.
  • Mårten Mickos announces that they are making MySQL ready for an IPO in 2008 on a projected $100 million in revenues.
  • 8 million active installations.
  • MySQL has 320 employees in 25 countries, 70 percent of whom work from home.
  • Raised a $18 million Series C based on a rumored valuation north of $300 million.
  • MySQL is estimated to have a 33% market share measured in install base and 0.2% market share measured in revenue (the database market was a $15 billion market in 2006).
  • Ended the year with $50 million in revenue.

2007

  • Ended the year with $75 million in revenue.

2008

  • Sun Microsystems acquired MySQL AB for approximately $1 billion.
  • Michael Widenius (Monty) and David Axmark, two of MySQL AB's co-founders, leave Sun shortly after the acquisition.

2009

  • Mårten Mickos leaves Sun and becomes entrepreneur-in-residence at Benchmark Capital. Sun has now lost the business and spiritual leaders that turned MySQL into a success.
  • Sun Microsystems and Oracle announced that they have entered into a definitive agreement under which Oracle will acquire Sun common stock for $9.50 per share in cash. The transaction is valued at approximately $7.4 billion.

February 2016 update: After 6 years, I made several small adjustments based on feedback from Mårten Mickos! It's also worth noting that MySQL is still going strong 8 years after the acquisition by Sun. MySQL is still used by Facebook, Google, Yahoo, Wikipedia and virtually every Drupal and WordPress site. Engineering is led by Tomas Ulin who joined MySQL AB back in 2003. The whole business is led by Richard Mason who joined MySQL AB in 2005. The team has grown, but at the core are the same people as before. The team is distributed, as always. The optimizer has been largely rewritten by a team in Norway that was added to the MySQL team when the company was acquired by Sun. NoSQL features have been added.

NoSQL and SQL

Have a look at this video of Brian Aker's great 10 minute lightning talk about NoSQL. NoSQL is a database movement which promotes non-relational data stores that do not need a fixed schema.

With the exception of Memcached, the Drupal world is very SQL driven -- and for all the right reasons. However, both at Mollom and Acquia we have to deal with some big data volumes and have been using some of the NoSQL technologies like Hadoop, Thrift, Hbase, Tokio Cabinet, etc.

While these are great tools, at the end of the day, NoSQL is an "and" and not a "versus". Plus, I expect the gap to close as there are a couple of interesting projects under way that bring some of the NoSQL advantages to the SQL world. One of them is Brian Aker's own Drizzle project, another project I can't talk about yet ... Sufficient to say, don't ditch your relation database just yet.

Either way, have a look Brian's NoSQL presentation. It's funny!

Drupal in the cloud

It is not always easy to scale Drupal -- not because Drupal sucks, but simply because scaling the LAMP stack (including Drupal) takes no small amount of skill. You need to buy the right hardware, install load balancers, setup MySQL servers in master-slave mode, setup static file servers, setup web servers, get PHP working with an opcode cacher, tie in a distributed memory object caching system like memcached, integrate with a content delivery network, watch security advisories for every component in your system and configure and tune the hell out of everything.

Either you can do all of the above yourself, or you outsource it to a company that knows how to do this for you. Both are non-trivial and I can count the number of truly qualified companies on one hand. Tag1 Consulting is one of the few Drupal companies that excel at this, in case you're wondering.

My experience is that MySQL takes the most skill and effort to scale. While proxy-based solutions like MySQL Proxy look promising, I don't see strong signals about it becoming fundamentally easier for mere mortals to scale MySQL.

It is not unlikely that in the future, scaling a Drupal site is done using a radically different model. Amazon EC2, Google App Engine and even Sun Caroline are examples of the hosting revolution that is ahead of us. What is interesting is how these systems already seem to evolve: Amazon EC2 allows you to launch any number of servers but you are pretty much on your own to take advantage of them. Like, you still have to pick the operating system, install and configure MySQL, Apache, PHP and Drupal. Not to mention the fact that you don't have access to a good persistent storage mechanism. No, Amazon S3 doesn't qualify, and yes, they are working to fix this by adding Elastic IP addresses and Availability Zones. Either way, Amazon doesn't make it easier to scale Drupal. Frankly, all it does is making capacity planning a bit easier ...

Then comes along Amazon SimpleDB, Google App Engine and Sun Caroline. Just like Amazon EC2/S3 they provide instant scalability, only they moved things up the stack a level. They provide a managed application environment on top of a managed hosting environment. Google App Engine provides APIs that allow you to do user management, e-mail communication, persistent storage, etc. You no longer have to worry about server management or all of the scale-out configuration. Sun Caroline seems to be positioned somewhere in the middle -- they provide APIs to provision lower level concepts such as processes, disk, network, etc.

Unfortunately for Drupal, Google App Engine is Python-only, but more importantly, a lot of the concepts and APIs don't map onto Drupal. Also, the more I dabble with tools like Hadoop (MapReduce) and CouchDB, the more excited I get, but the more it feels like everything that we do to scale the LAMP stack is suddenly wrong. I'm trying hard to think beyond the relational database model, but I can't figure out how to map Drupal onto this completely different paradigm.

So while the center of gravity may be shifting, I've decided to keep an eye on Amazon's EC2/S3 and Sun's Caroline as they are "relational database friendly". Tools like Elastra are showing a lot of promise. Elastra claims to be the world's first infinitely scalable solution for running standard relational databases in an on-demand computing cloud. If they deliver what they promise, we can instantly scale Drupal without having to embrace a different computing model and without having to do all of the heavy lifting. Specifically exciting is the fact that Elastra teamed up with EnterpriseDB to make their version of PostgreSQL virtually expand across multiple Amazon EC2 nodes. I've already reached out to Elastra, EnterpriseDB and Sun to keep tabs on what is happening.

Hopefully, companies like Elastra, EnterpriseDB, Amazon and Sun will move fast because I can't wait to see relational databases live in the cloud ...

Database replication lag

As explained in an earlier blog post, we recently started using MySQL master-slave replication on drupal.org in order to provide the scalability necessary to accommodate our growing demands. With one or more replicas of our database, we can instruct Drupal to distribute or load balance the SQL workload among different database servers.

MySQL's master-slave replication is an asynchronous replication model. Typically, all the mutator queries (like INSERT, UPDATE, DELETE) go to a single master, and the master propagates all updates to the slave servers without synchronization or communication. While the asynchronous nature has its advantages, it is also means that the slaves might be (slightly) out of sync.

Consider the following pseudo-code:

$nid = node_save($data);
$node = node_load($nid);

Because node_save() executes a mutator query (an INSERT or UPDATE statement) is has to be executed on the master, so the master can propagate the changes to the slaves. Because node_load() uses a read-only query, it can go to the master or any of the available slaves. Because of the lack of synchronization between master and slaves, there is one obvious caveat: when we execute node_load() the slaves might not have been updated. In other words, unless we force node_load() to query the master, we risk not being able to present the visitor the data that he just saved. In other cases, we risk introducing data inconsistencies due to the race conditions.

So what is the best way to fix this?

  1. Our current solution on drupal.org is to execute all queries on the master, except for those that we know can't introduce race conditions. In our running example, this means that we'd chose to execute all node_load()s on the master, even in absence of a node_save(). This limits our scalability so this is nothing but a temporary solution until we have a good solution in place.
  2. One way to fix this is to switch to a synchronous replication model. In such a model, all database changes will be synchronized across all servers to ensure that all replicas are in a consistent state. MySQL provides a synchronous replication model through the NDB cluster storage engine. Stability issues aside, MySQL's cluster technology works best when you avoid JOINs and sub-queries. Because Drupal is highly relational, we might have to rewrite large parts of our code base to get the most out of it.
  3. Replication and load balancing can be left to some of the available proxy layers, most notably Continuent's Sequoia and MySQL Proxy. Drupal connects to the proxy as if it was the actual database, and the proxy talks to the underlying databases. The proxy parses all the queries and propagates mutator queries to all the underlying databases to make sure they remain in a consistent state. Reads are only distributed among the servers that are up-to-date. This solution is transparent to Drupal, and should work with older versions of Drupal. The only downside is that it not trivial to setup, and based on my testing, it requires quite a bit of memory.
  4. We could use database partitioning, and assign data to different shards in one way or another. This would reduce the replica lag to zero but as we don't have that much data or database tables with millions of rows, I don't think partitioning will buy drupal.org much.
  5. Another solution is to rewrite large parts of Drupal so it is "replication lag"-aware. In its most naive form, the node_load() function in our running example would get a second parameter that specifies whether the query should be executed on the master or not. The call should then be changed to node_load($nid, TRUE) when proceeded by a node_save().
    I already concluded through research that this is not commonly done; probably because such a solution still doesn't provide any guarantees across page request.
  6. A notable exception is MediaWiki, the software behind Wikipedia which has some documented best practices to deal with replication lag. Specifically, they recommend to query the master (using a very fast query) to see what version of the data they have to retrieve from the slave. If the specified version is not yet available on the slave due to replication lag, they simply wait for it to become available. In our running example, each node should get a version number and node_load() would first retrieve the latest version number from the master and then use that version number to make sure it gets an up-to-date copy from the slave. If the right version isn't yet available, node_load() will try loading the data again until it becomes available.

Pages

Updates from Dries straight to your mailbox