One million spam attempts blocked

Last weekend, just 3 weeks after we launched Mollom, Mollom has blocked the one millionth spam attempt. That is a million tiny contributions to make the web a nicer place. Incidentally, Mollom also got mentioned on Techcrunch that same weekend. Milestone weekend!

BussinessWeek's Young Entrepreneurs of Tech

It is a real honor and privilege that BusinessWeek included me on their list of top 30-and-under innovators for 2008. I'm very happy to see Drupal getting this level of recognition from the business world. What BusinessWeek only hints at, though, is the importance of all the thousands of people, including my new colleagues at Acquia, who are working every day to make Drupal great. This is a big milestone for Drupal so congratulations to all of you.
Businessweek techs best young entrepreneurs

Website spam and moderation queues

Mollom is a web service that blocks website spam. Websites using Mollom send data they want checked to, and Mollom replies with either a spam or ham classification. If Mollom is not certain, it will return an unsure classification, typically prompting websites to ask Mollom's CAPTCHA server for an audio or visual CAPTCHA challenge to present to the user. In other words, Mollom uses a classifier with three states: ham, spam and unsure. We explained that in detail on the "How Mollom works" page.

Over at the Mollom blog, Ben wrote a great post about why we believe this is a key difference, and how that allows Mollom to eliminate your moderation queue. A picture is worth more than a thousand words, so check out the plots below and check out Ben's blog post for more details.

Spam versus ham

The plot illustrates that having a binary classifier with only two states (ham and spam) is bound to make mistakes. The plot is generated from the actual data in Mollom's database.

As you can see on the first graph, a binary classifier with two states (ham, spam) is never going to be deadly accurate, and will require a moderation queue so the user can manually deal with legitimate comments that incorrectly got classified as spam. Unfortunately, moderation queues are not fun, and they don't make you any more productive. You'll still find yourself wading through thousands of spam comments looking for ham. In other words, a moderation queue doesn't really solve the problem -- it just makes the problem look different.

Spam versus ham

The plot illustrates that having a classifier with three states avoids false positives and false negatives. The plot is generated from the actual data in Mollom's database.

Time for something better. As you can see on the second graph, a classifier with three states is going to be a lot more accurate. In fact, Mollom is so accurate that the Drupal module doesn't come with a moderation queue! It an important distinction, and one of the many innovations that we have in store for you. Bye bye moderation queue!

Drupal in the cloud

It is not always easy to scale Drupal -- not because Drupal sucks, but simply because scaling the LAMP stack (including Drupal) takes no small amount of skill. You need to buy the right hardware, install load balancers, setup MySQL servers in master-slave mode, setup static file servers, setup web servers, get PHP working with an opcode cacher, tie in a distributed memory object caching system like memcached, integrate with a content delivery network, watch security advisories for every component in your system and configure and tune the hell out of everything.

Either you can do all of the above yourself, or you outsource it to a company that knows how to do this for you. Both are non-trivial and I can count the number of truly qualified companies on one hand. Tag1 Consulting is one of the few Drupal companies that excel at this, in case you're wondering.

My experience is that MySQL takes the most skill and effort to scale. While proxy-based solutions like MySQL Proxy look promising, I don't see strong signals about it becoming fundamentally easier for mere mortals to scale MySQL.

It is not unlikely that in the future, scaling a Drupal site is done using a radically different model. Amazon EC2, Google App Engine and even Sun Caroline are examples of the hosting revolution that is ahead of us. What is interesting is how these systems already seem to evolve: Amazon EC2 allows you to launch any number of servers but you are pretty much on your own to take advantage of them. Like, you still have to pick the operating system, install and configure MySQL, Apache, PHP and Drupal. Not to mention the fact that you don't have access to a good persistent storage mechanism. No, Amazon S3 doesn't qualify, and yes, they are working to fix this by adding Elastic IP addresses and Availability Zones. Either way, Amazon doesn't make it easier to scale Drupal. Frankly, all it does is making capacity planning a bit easier ...

Then comes along Amazon SimpleDB, Google App Engine and Sun Caroline. Just like Amazon EC2/S3 they provide instant scalability, only they moved things up the stack a level. They provide a managed application environment on top of a managed hosting environment. Google App Engine provides APIs that allow you to do user management, e-mail communication, persistent storage, etc. You no longer have to worry about server management or all of the scale-out configuration. Sun Caroline seems to be positioned somewhere in the middle -- they provide APIs to provision lower level concepts such as processes, disk, network, etc.

Unfortunately for Drupal, Google App Engine is Python-only, but more importantly, a lot of the concepts and APIs don't map onto Drupal. Also, the more I dabble with tools like Hadoop (MapReduce) and CouchDB, the more excited I get, but the more it feels like everything that we do to scale the LAMP stack is suddenly wrong. I'm trying hard to think beyond the relational database model, but I can't figure out how to map Drupal onto this completely different paradigm.

So while the center of gravity may be shifting, I've decided to keep an eye on Amazon's EC2/S3 and Sun's Caroline as they are "relational database friendly". Tools like Elastra are showing a lot of promise. Elastra claims to be the world's first infinitely scalable solution for running standard relational databases in an on-demand computing cloud. If they deliver what they promise, we can instantly scale Drupal without having to embrace a different computing model and without having to do all of the heavy lifting. Specifically exciting is the fact that Elastra teamed up with EnterpriseDB to make their version of PostgreSQL virtually expand across multiple Amazon EC2 nodes. I've already reached out to Elastra, EnterpriseDB and Sun to keep tabs on what is happening.

Hopefully, companies like Elastra, EnterpriseDB, Amazon and Sun will move fast because I can't wait to see relational databases live in the cloud ...

Pink using Drupal


Pink is using Drupal for her official website. She also began using Mollom to protect her site's content, and we've blocked thousands of spam attempts since. The site was built by her label Sony BMG. Yay!

Call me soft, but my favorite Pink song is "I have seen the rain", a song written by her father while he was stationed in Vietnam. You can watch the YouTube video below ...


© 1999-2015 Dries Buytaert Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.
Drupal is a Registered Trademark of Dries Buytaert.