Drupal site crawler project

For years now, people have been asking me how many Drupal sites there are. This is, of course, something of a moving target and I figured that the only way to answer that question was to count all the Drupal sites out there one by one. Three years ago, I was finally motivated enough to write a Drupal site crawler that looks over the millions of websites online to find those powered by Drupal. The crawler initially ran for about 3 months, and it returned a lot of Drupal sites. For each one it found, I then did some data-mining to extract the location of their hosting servers. I made a heatmap visualization of all the Drupal sites on a map of the world, and was able to start tracking Drupal's geographical growth patterns over time. As a bonus, the crawler also counted Joomla!, Mambo, and Wordpress sites, as well as a number of other open source content management systems - it is good data on Drupal and the competition.

Writing that crawler was a really fun project. It taught me a ton about how Drupal is used, where Drupal was growing, and how we compared to other content management systems. On the technical side, I learned a ton about scalability. Thanks to my engineering background and my work on Drupal, scalability issues weren't new to me, but writing a crawler that processes information from billions of pages on the web, is a whole different ball park. At various different stages of the project and index sizes, some of the crawler’s essential algorithms and data structures got seriously bogged down. I made a lot of trade-offs between scalability, performance and resource usage. In addition to the scalability issues, you also have to learn to deal with massive sites, dynamic pages, wild-card domain names, rate limiting and politeness, cycles in the page/site graph, discovery of new sites, etc.

Even though the crawler ran for many months, I never really launched it or talked about it publicly. I personally lacked the resources (both my time and the money to run the servers) to keep it running all the time. I ultimately stopped the crawler altogether and to put the project on hold. Since then, a couple of things have changed: I learned a lot more about scalability thanks to my ongoing work on Drupal and on Mollom, which now processes hundreds of thousands of spam messages a day. I also co-founded Acquia along the way. Acquia shares my personal interest in tracking Drupal's growth and has the resources to help me revive my crawler. Last but not least, there has been a lot of innovation and knowledge sharing around how to solve scalability problems. As such, I'd now like to pick up the project where I left off 3 years ago, and relaunch it under the Acquia umbrella.

One thing hasn’t changed: the day remains +/- 24 hours long. I still don't have enough time to work on this, and have many priorities I didn’t have 3 years ago. That is why I'm looking for a summer intern who wants to take the lead on this project for a few months. The crawler is written in Java, so I'm looking for a student proficient in Java, and who also understands fundamental internet protocols such as DNS, HTTP and HTML. Experience with building scalable multi-server systems is a plus but not strictly required. I don't care where you live, but I only want to work with people who work hard and who have strong programming skills. Are you interested in being my intern for the summer? Contact me at my contact form and send me a copy of your resume. Or if you know someone who would be a good candidate, please point them to this blog post.

Comments

NonProfit (not verified):

This is a great project. I can't wait to read about your findings.

May 11, 2009
Itkovian (not verified):

I recall reading posts on several places from people who had been visited by your crawler, checking out where it came from and discovering Drupal along the way. Sadly, most people who checked you out, did not enjoy your search back then, so perhaps this is something to keep in mind?

May 11, 2009
mattie (not verified):

I remember the same :) although I don't really understand why one would be really upset about this..

May 27, 2009
Wim Leers (not verified):

Wow, very interesting project to work on! :)

May 11, 2009
Dave Reid (not verified):

I always thought it would be neat if search engines could expose this information as well. Something like "Dries sitetype:Drupal" in Google would be interesting.

May 11, 2009
Tj Holowaychuk (not verified):

The Java part explains some of the performance issues haha.. Interesting project though, I am sure with reasonably beefy hardware this could be done quite fast. What techniques are you using to "discover" if a site is using Drupal or not? I am working on a lightning fast C DOM parser that might be useful

May 12, 2009
Andy Forbes (not verified):

Dries:

What with backward compatibility not being a priority in major Drupal releases, something like this would be extremely useful for business development for Drupal consulting shops - you may want to keep this in mind as you decide how to share what you find.

Also, through late last year I was responsible for a large Drupal site that was wrestling with significent performance problems - so much so that we contacted the major search engines to get them to throttle back how much they spidered our site. You might want to consider taking page load time into consideration when you develop your code, and make sure your tool doesn't hammer sites that are already slow. You noted this issue in your post, but when I was strugguling with performance problems it would have been especially frustrating to an Acquia based crawler slowing my site.

Finally, and I suspect that you have already thought of this, but the folks at Google seem to like Drupal - is there any chance they'll give you access to their collected data so you can shortcut the whole crawler / spider development process and go straight to analysis?

Andy

May 12, 2009
Greg Simkins (not verified):

I set up a Drupal site 3 months ago and did nothing with it until now. I found three people had created accounts on my site. I was a bit alarmed and searched for how to locate Drupal sites using Google and found this blog. We shut off the ability to create accounts and learned a lesson. But it shows a small downside to this project.

December 13, 2009
Jonathan Pugh (not verified):

Hi Dries!
I was looking for a link referring to the results you announced at DrupalConSF about Drupal powering more than 1% of the web... the only thing I found were people tweeting it, and the slide.

You may already be working on it, but do you have anything you can post about this amazing statistic? I'd like to talk about and link to it but all I can find is the slides of the presentation.

Thanks!
Jon

May 19, 2010

Add new comment

© 1999-2014 Dries Buytaert Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.
Drupal is a Registered Trademark of Dries Buytaert.