This is a mirror of official site: http://jasper-net.blogspot.com/

Mollom Architecture - Killing Over 373 Million Spams At 100 Requests Per Second

| Wednesday, February 9, 2011
Mollom is one of those cool SaaS companies every developer dreams of creating when they wrack their brains looking for a viable software-as-a-service startup. Mollom profitably runs a useful service—spam filtering—with a small group of geographically distributed developers. Mollom helps protect nearly 40,000 websites from spam, including one of mine, which is where I first learned about Mollom. In a desperate attempt to stop spam on a Drupal site, where every other form of CAPTCHA had failed miserably, I installed Mollom in about 10 minutes and it immediately started working. That's the out of the box experience I was looking for.

From the time Mollom opened it's digital inspection system they've rejected over 373 million spams and in the process they've learned that a stunning 90% of all messages are spam. This spam torrent is handled by only two geographically distributed machines that handle 100 requests/ second, each running a Java application server and Cassandra. So few resources are necessary because they've created a very efficient machine learning system. Isn't that cool? So, how do they do it?

To find out I interviewed Benjamin Schrauwen, cofounder of Mollom, and Johan Vos, Glassfish and Java enterprise expert. Proving software knows no national boundaries, Mollom HQ is located in Belgium  (other good things from Belgium: Hercule Poirot, chocolate, waffles).

Statistics

  • Serving 40,000 active websites, many of which are very large customers like Sony Music, Warner Brothers, Fox News, and The Economist. A lot of big brands, with big websites, and a lot of comments.
  • Find 1/2 million spam messages each day.
  • Handle 100 API calls/second.
  • A spam check is low latency, taking between 30-50msecs. The slowest connection would be 500msec. The 95th percentile of latency is 250msecs. It's really optimized for speed. 
  • Spam classification efficiency is at 99.95%. This means that only 5 in 10,000 spam messages were not caught by Mollom.
  • Netlog, which is a social networking site in Europe, has their own Mollom setup in their own datacenter. Netlog handles about 4 million messages a day on custom classifiers that are trained on their data. 

Platform
  • Two production servers run in two different datacenters for failover. 
  • One server is on the East coast and one is on the West coast. 
  • Each server is an Intel Xeon Quad core, 2.8GHz, 16GB RAM, 4 disks of 300 GB, RAID 10. 
  • SoftLayer - the machines are hosted by SoftLayer.
  • Cassandra - a NoSQL database selected for it's write performance and ability to operate across multiple datacenters. 

Read more: High Scalability
Read more: Mollom

Posted via email from Jasper-net

0 comments: