Moving an Elephant: Large Scale Hadoop Data Migration at Facebook

By Paul Yang on Wednesday, July 27, 2011 at 9:19am

Users share billions of pieces of content daily on Facebook, and it’s the data infrastructure team's job to analyze that data so we can present it to those users and their friends in the quickest and most relevant manner. This requires a lot of infrastructure and supporting data, so much so that we need to move that data periodically to ever larger data centers. Just last month, the data infrastructure team finished our largest data migration ever – moving dozens of petabytes of data from one data center to another.

 

During the past two years, the number of shared items has grown exponentially, and the corresponding requirements for the analytics data warehouse have increased as well. As the majority of the analytics is performed with Hive, we store the data on HDFS — the Hadoop distributed file system.  In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress! At that point, we had run out of power and space to add more nodes, necessitating the move to a larger data center. 

 

Move the Boxes or Mirror the Data?

The scale of this migration exceeded all previous ones, and we considered a couple of different migration strategies. One was a physical move of the machines to the new data center. We could have moved all the machines within a few days with enough hands at the job. However, this was not a viable option as our users and analysts depend on the data 24/7, and the downtime would be too long.

 

Another approach was to set up a replication system that mirrors changes from the old cluster to the new, larger cluster. Then at the switchover time, we could simply redirect everything to the new cluster. This approach is more complex as the source is a live file system, with files being created and deleted continuously. Due to the unprecedented cluster size, a new replication system that could handle the load would need to be developed. However, because replication minimizes downtime, it was the approach that we decided to use for this massive migration.

 

Replication It Is

Once the required systems were developed, the replication approach was executed in two steps. First, a bulk copy transferred most of the data from the source cluster to the destination. Most of the directories were copied via DistCp — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel. Our Hadoop engineers made code and configuration changes to handle special cases with Facebook's dataset, including the ability for multiple mappers to copy a single large file, and for the proper handling of directories with many small files. After the bulk copy was done, file changes after the start of the bulk copy were copied over to the destination cluster through the new replication system. File changes were detected through a custom Hive plug-in that recorded the changes to an audit log. The replication system continuously polled the audit log and copied modified files so that the destination would never be more than a couple of hours behind. The plug-in recorded Hive metadata changes as well, so that metadata modifications such as the last accessed time of Hive tables and partitions were propagated. Both the plug-in and the replication system were developed in-house by members of the Hive team.

 

At the final migration switchover time, we set up camp in a war room and shut down Hadoop JobTracker so that new files would not be created. Then, the replication system was allowed to catch up. During this time, user directories were copied over as well. Once replication was caught up, both clusters were identical, and we changed the DNS entries so that the hostnames referenced by Hadoop jobs pointed to the servers in the new cluster. We started the JobTracker in the new data center, and the jobs were able to run as usual, with no modifications required. Naturally, we all grabbed drinks after all was said and done.

 

Size Matters

There were many challenges in both the replication and switchover steps. For replication, the challenges were in developing a system that could handle the size of the warehouse. The warehouse had grown to millions of files, directories, and Hive objects. Although we’d previously used a similar replication system for smaller clusters, the rate of object creation meant that the previous system couldn’t keep up. A rewrite of the system with multi-threading and an additional back-end service to handle HDFS operations enabled a sufficiently high object copy rate to complete the migration in time. For the switchover to the new cluster, the challenges stemmed from managing the large number of systems that interacted with the warehouse and the MapReduce cluster. We have several gating systems in place that control job submission and scheduling. For the switchover, we had to shut down those gates and restart them once all the DNS and configuration changes were complete.

 

Move Fast, Replicate Things

A big lesson we learned was that having a fast replication system was invaluable in addressing issues that arose. Corrupt files could easily be remedied with an additional copy without affecting our schedule. If the replication process could just barely keep up with the workload, then any recopy could have resulted in missed deadlines. Also, a fast replication system helped to minimize the downtime during the switchover, as the downtime is mainly dictated by how quickly both clusters can be brought to identical states.

 

As an additional benefit, the replication system also demonstrated a potential disaster-recovery solution for warehouses using Hive. Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag. With replication deployed, operations could be switched over to the replica cluster with relatively little work in case of a disaster. The replication system could increase the appeal of using Hadoop for high-reliability enterprise applications.

 

By using the replication approach, the migration was an almost seamless process. Now that we're moved to a bigger cluster, we can continue to improve Facebook and deliver the most relevant content to everyone.

 

The next set of challenges for us include providing an ability to support a data warehouse that is distributed across multiple data centers. If you're interested in working on these and other "petascale" problems related to Hadoop, Hive, or just large systems, come join Facebook's data infrastructure team!

 

The data infrastructure team in the war room during the final switchover.

The data infrastructure team in the war room during the final switchover.