Building Realtime Insights
Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time.
To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. This system had to be able to process many different types of events, and do so in a way that accounted for an uneven distribution of keys. For example, Charlie Sheen articles are currently generating far more Likes and impressions on Facebook than my personal website, and there are far more URLs per domain from a site like espn.com than there is for my personal blog.
We tested out a few different architectures for this before settling on the one that we shipped. MySQL counters were not able to handle the write rate even with creative solutions around write batching; in-memory counters did not meet reliability requirements; and MapReduce, though extensible, brought with it too much latency and weird behaviors when multiple processes were waiting on the same table or query.
The Insights for Websites that we launched was based on storage in HBase - the distributed, column-oriented Hadoop database; instrumentation that flows through a log file system built atop HDFS known internally as scribe; and a client that tails, batches, and writes data streams out of scribe and into HBase. We chose HBase because of its ability to handle a very high write rate with high reliability. The Write Ahead Log is of HBase is key to enabling this.
There are a lot of other details to the architectural design such as table schema; key composition; nuances of batching; and sharding. You can hear more about it in our first Seattle tech talk and look for future engineering blog posts.