Looking at the code behind our three uses of Apache Hadoop

By Dhruba Borthakur on Friday, December 10, 2010 at 9:41am

The size of the data warehouse cluster at Facebook has been increasing tremendously over the past few years. We use several pieces of open source software in our data warehouse including Apache Hadoop, Apache Hive, Apache HBase, Apache Thrift and Facebook Scribe. Together they keep this data processing engine humming.

 

We’ve now open sourced the exact versions of Apache Hadoop we run in production at Facebook. These branches are based on the Apache Hadoop 0.20 release, but have a number of new features which haven’t yet made it into any official Hadoop release. Our engineering team contributes changes (such as the patches in these branches) upstream to the Hadoop project and you’ll find that we’ve listed JIRA tasks along with most of the changes.

 

Apache Hadoop is being used in three broad types of systems: as a warehouse for web analytics, as storage for a distributed database and for MySQL database backups.

 

Data warehousing clusters are largely batch processing systems where the emphasis is on scalability and throughput. We have enhanced the NameNode’s locking model to scale to 30 petabytes, possibly the largest single Hadoop cluster in the world. We have multiple clusters at separate data centers and our largest warehouse cluster currently spans three thousands of machines. Our jobs scan around 2 petabytes per day in this warehouse and more than 300 people throughout the company query this warehouse every month.  Our primary bottleneck is raw disk capacity closely followed by CPU during peak load times. The code that runs on these types of clusters is at http://github.com/facebook/hadoop-20-warehouse.

 

The realtime application-serving usage is a more recent usage of Apache Hadoop for us.  We have enhanced Apache Hadoop to make latencies small and response times faster.   One of our newest application Messages uses a distributed database called Apache HBase to store data in Apache Hadoop clusters. Another use case is to collect click logs in near real time from our web servers and stream them directly into our Hadoop clusters.  We use Scribe to make this easier. Scribe is well integrated to store data directly into Apache Hadoop. Most of these realtime clusters are smaller in size, somewhere around hundred node each. The code that runs on these clusters is at http://github.com/facebook/hadoop-20-append.

 

The third use case is for medium-term archiving of MySQL databases. This solution provides fast backup and recovery from data stored in Hadoop File System  and reduces our maintenance and deployment costs for archiving petabyte size datasets.

 

What rockstars do we have in our code base?

1) AvatarNode

Both the warehouse codebase and the realtime codebase contain the Hadoop High Availability AvatarNode which allows an always-up Hadoop Distributed File System (HDFS). The AvatarNode is a software wrapper around the Hadoop NameNode. There are two AvatarNodes in a single HDFS cluster; one acts as the primary and the other as a hot standby. The primary AvatarNode and the standby AvatarNode are coordinated via Apache ZooKeeper. The administrator can toggle the avatar of each of these nodes, thereby allowing us to deploy new software to the cluster without incurring downtime. The time to failover is a few seconds for large clusters. This patch has been contributed back to Apache Hadoop in HDFS-976.

 

2) RaidNode

The warehouse codebase has HDFS-RAID that uses parity files to decrease storage usage. Disk is cheap, but 30 petabytes of storage is quite costly! Hadoop, by default, keeps three copies of data. HDFS-RAID creates parity blocks from data blocks, thus allowing us to store fewer copies of data but maintaining the same probability of data loss. We have deployed the XOR-parity generator, thus saving about 5 petabytes of raw storage in our production deployment. We are actively developing a Reed-Solomon variant of generating parity blocks, this will allow us to store even fewer copies of data. We have contributed this feature to Apache Hadoop and you can find it in the Hadoop MapReduce sub-project.

 

3) Appends

The realtime codebase has full support for reliable HDFS file-append operations and is a basic requirement for Apache HBase to be able to store indexes reliably on HDFS. Earlier versions of Apache HBase were susceptible to data loss because HDFS did not have an API to sync data to stable storage. We have added a new interface to Apache Hadoop 0.20 to reliably sync data to stable storage. This is the first known instance of Apache Hadoop in a production setting that supports zero data loss while using Apache HBase. We have contributed this feature to the 0.20-append branch of Apache Hadoop.

 

What is in the pipeline?

One of our current challenges is that the physical limitation of a data center is starting to limit the size of a single cluster. We are working on a solution to make Apache Hadoop functional across data centers using HighTide, see HDFS-1432.  Another challenge is to increase the hardware utilization in a heterogeneous environment. We are working to make the job scheduler resource aware and move towards slotless map-reduce. We are also working on supporting snapshots on HDFS. We are developing techniques to be able to ship new code to the JobTracker without incurring any downtime.

 

We will update the JIRAs and GitHub repositories with these features and many more in the next few months. You’ll also find Yahoo!’s tree of Hadoop source code at https://github.com/yahoo/hadoop-common. Stay tuned!

 

Dhruba is an engineer on our data infrastructure team.