Scaling Out

by Jason Sobel on Wednesday, August 20, 2008 at 11:05am ·

I joined Facebook in April 2007 and, after getting settled over the course of a few weeks, my manager Robert Johnson approached me. We talked for a while but the conversation boiled down to:

Bobby: "So, Jason, we're going to open a new datacenter in Virginia by 2008. Do you think you can help?"
Me: "Uh.... yes?"
Bobby: "Great!"

My first project at Facebook was a tad more involved then I was expecting, but I think that is one reason why we have such a great engineering organization; we have a lot of hard problems to solve and everyone here is excited to jump in and tackle them. I set out to really understand why we were building a new datacenter and what problems we had to overcome to make it work.

Why Bother?


The primary reason for building a new datacenter on the east coast was latency. It takes about 70 milliseconds to send a packet across the country on a high-speed link, and it can be much longer for an average internet user. By putting servers in Virginia we could reduce the time to send a page to users on the east coast and in Europe by a noticeable amount.

Secondary concerns were space, power, and disaster recovery. We were running out of physical space in our primary datacenters in California and the Virginia site would give us lots of room to grow. We were having a similar problem with getting enough electricity to power all those servers. Finally, restricting ourselves to only one location meant that, in the event of a disaster (power failure, earthquake, Godzilla), Facebook could be unusable for extended periods of time.

Build It!

Before we could go to work on the application level challenges our operations team put in a heroic effort to build out the servers and the physical space in Virginia. They also brought up the intra-datacenter network and the low latency inter-datacenter fiber channel link. This work was an enormous undertaking but our operations team is top-notch and made it all look easy.

With the network and hardware in place we set up our standard 3 tier architecture: web server, memcache server, and MySQL database. The MySQL databses in Virginia were going to run as slaves of the west coast databases, so we spent a couple weeks copying all the data across the country and setting up replication streams.

Now that the hardware, network, and basic infrastructure was set up it was time to face the two main application level challenges: cache consistency and traffic routing.

Cache Consistency



A bit of background on our caching model: when a user modifies a data object our infrastructure will write the new value in to a database and delete the old value from memcache (if it was present). The next time a user requests that data object we pull the result from the database and write it to memcache. Subsequent requests will pull the data from memcache until it expires out of the cache or is deleted by another update.

This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.

When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.

Now let's say we delete the value from Virginia memcache tier at the time we update the master database in California. A subsequent read from the slave database in Virginia might see the old value instead of the new one because of replication lag. Then Virginia memcache would be updated with the old (incorrect) value and it would be "trapped" there until another delete. As you can see, in the worst case the Virginia memcache tier would always be one "version" behind of the correct data.

Consider the following example:
  1. I update my first name from "Jason" to "Monkey"

  2. We write "Monkey" in to the master database in California and delete my first name from memcache in California and Virginia

  3. Someone goes to my profile in Virginia

  4. We don't find my first name in memcache so we read from the Virginia slave database and get "Jason" because of replication lag

  5. We update Virginia memcache with my first name as "Jason"

  6. Replication catches up and we update the slave database with my first name as "Monkey"

  7. Someone else goes to my profile in Virginia

  8. We find my first name in memcache and return "Jason"



Until I update my first name again or it falls out of cache and we go back to the database, we will show my first name as "Jason" in Virginia and "Monkey" in California. Confusing? You bet. Welcome to the world of distributed systems, where consistency is a really hard problem.

Fortunately, the solution is a lot easier to explain than the problem. We made a small change to MySQL that allows us to tack on extra information in the replication stream that is updating the slave database. We used this feature to append all the data objects that are changing for a given query and then the slave database "sees" these objects and is responsible for deleting the value from cache after it performs the update to the database.

How'd we do it? MySQL uses a lex parser and a yacc grammar to define the structure of a query and then parse it. I've simplified the following for ease of explanation, but at the highest level this grammar looks like:

query:
statement END_OF_INPUT {};

statement:
alter
| analyze
| backup
| call
... (insert, replace, select, etc.)

Pretty straightforward, right? A query is a statement which breaks down to one of the MySQL expressions we all know and love. We modified this grammar to allow appending memcache keys to the end of any query, as follows:

query:
statement mc_dirty END_OF_INPUT {};

mc_dirty:
{}
| MEMCACHE_DIRTY mc_key_list;

mc_key_list:
mc_key_list ',' text_string { Lex->mc_key_list.push_back($3); }
| text_string { Lex->mc_key_list.push_back($1); };

A query now has an additional component; after the statement comes the mc_dirty which is either empty or a keyword MEMCACHE_DIRTY followed by a mc_key_list. A mc_key_list is just a comma-separated list of strings and the rule tells the parser to push all the strings one-by-one on to a vector named mc_key_list which is stored inside a per-query parser object.

As an example, an old query might look like:
REPLACE INTO profile (`first_name`) VALUES ('Monkey') WHERE `user_id`='jsobel'
and under the new grammar it would change to:
REPLACE INTO profile (`first_name`) VALUES ('Monkey') WHERE `user_id`='jsobel' MEMCACHE_DIRTY 'jsobel:first_name'

The new query is telling MySQL that, in addition to changing my first name to Monkey, it also needs to dirty a corresponding memcache key. This is easily implemented. Since the per-query parser object now stores all memcache keys we tack on to a query, we added a small piece of code at the end of mysql_execute_command that dirties those keys if the query is successful. Voila, we've hijacked the MySQL replication stream for our own purpose: cache consistency.

The new workflow becomes (changed items in bold):

  1. I update my first name from "Jason" to "Monkey"

  2. We write "Monkey" in to the master database in California and delete my first name from memcache in California but not Virginia

  3. Someone goes to my profile in Virginia

  4. We find my first name in memcache and return "Jason"

  5. Replication catches up and we update the slave database with my first name as "Monkey." We also delete my first name from Virginia memcache because that cache object showed up in the replication stream

  6. Someone else goes to my profile in Virginia

  7. We don't find my first name in memcache so we read from the slave and get "Monkey"

Page Routing



The other main problem we had to address was that only our master databases in California could accept write operations. This fact meant we needed to avoid serving pages that did database writes from Virginia because each one would have to cross the country to our master databases in California. Fortunately, our most frequently accessed pages (home page, profiles, photo pages) don't do any writes under normal operation. The problem thus boiled down to, when a user makes a request for a page, how do we decide if it is "safe" to send to Virginia or if it must be routed to California?

This question turned out to have a relatively straightforward answer. One of the first servers a user request to Facebook hits is called a load balancer; this machine's primary responsibility is picking a web server to handle the request but it also serves a number of other purposes: protecting against denial of service attacks and multiplexing user connections to name a few. This load balancer has the capability to run in Layer 7 mode where it can examine the URI a user is requesting and make routing decisions based on that information. This feature meant it was easy to tell the load balancer about our "safe" pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location.

There is another wrinkle to this problem, however. Let's say you go to editprofile.php to change your hometown. This page isn't marked as safe so it gets routed to California and you make the change. Then you go to view your profile and, since it is a safe page, we send you to Virginia. Because of the replication lag we mentioned earlier, however, you might not see the change you just made! This experience is very confusing for a user and also leads to double posting. We got around this concern by setting a cookie in your browser with the current time whenever you write something to our databases. The load balancer also looks for that cookie and, if it notices that you wrote something within 20 seconds, will unconditionally send you to California. Then when 20 seconds have passed and we're certain the data has replicated to Virginia, we'll allow you to go back for safe pages.

Looking Back



Nine months after our first user viewed a page in the Virginia datacenter we're still running the same architecture with good success. There were bumps along the way, of course; for the first month or two the cache consistency infrastructure was very shaky and would periodically force us to divert traffic away from Virginia while we diagnosed and fixed bugs. Over time, however, we've ironed out the issues and now serve a substantial portion of Facebook's traffic out of this datacenter.

The main scaling challenge with this architecture is pretty obvious: all write operations must happen in one location. Going forward we're very excited to develop new technologies that will let us perform writes in any location. We're also thinking a lot about how to use our new datacenter as a disaster recovery site in case Godzilla decides to attack our California locations! Interested in helping us out? www.facebook.com/jobs!
· Comment · Share
  • Manikandan Murugavelu, Merry Christina, Himanshu Ghai and 44 others like this.
  • 1 share1 share
    • Mustafa K. Isik
      The 20 second assumption in the cookie would have left me uneasy. How would you go about scaling to more datacenters? Set specific max. replication lags for different datacenters?
      What metric/measurement is the 20 second setting for the Vir...ginia datacenter based on? Is it constant or dynamically updated based on some sort of continuous measurement procedure?

      Very interesting read. Thank you. I know it is easy to sound overly critic when the true challenge and work lies in actually realizing a system of any sort, especially a scalable one.
      See More
      August 21, 2008 at 3:31am
    • Mihai Anca Nice articles, I'd love more technical stuff.
      August 21, 2008 at 4:17am
    • Anand Das Good post, keep 'em coming!
      August 21, 2008 at 5:27am
    • Eric Hankins Great post. Love to hear this sort of stuff.
      August 21, 2008 at 8:52am
    • Syed Zeeshan Rizvi
      Instead of triggering off of a user request to update the db, one way of addressing this problem might be to sync the db's at periodic intervals with only the atomic changes to the db ..This can add to the existing traffic on your internal ...network you might say , well then , provisioning a dedicated mpls circuit just for syncing might be the solution for you ..
      The latency can also addressed by NOT looking at the network latency as a whole but rather profiling the traffic patterns for different geographic regions ( e.g. NY/Tri state area might have more traffic demands vs Little Rock Arkansas) and provisioning Qos parameters accordingly to address those demands accordingly...Tier-ing up the provisioning with highest response time for tier-1 users ( paying fb users in future may be ;-) ) and lowest for tier-n user can also be considered ..
      Bottom line : improving user experience and latency issues is a function of both network AND application tuning and can not be addressed with improving just one. My $0.02 :))
      See More
      August 21, 2008 at 10:10am
    • Yingkuan Liu
      Mustafa,
      Since Jason mentioned the replication lag can spike up to 20 seconds, so I assume the 20 second rule is based on that.

      Of course there's other method used to counter the replication lag, for example youtube use so called oracle ca...ching algorithm, they pre-fetch the blocks into cache by reading upcoming replication streams from rely-log and conert updates/deletes into selects. Of course it won't help in case of inserts.See More
      August 21, 2008 at 10:17am
    • Charles Nadeau Jason,
      Will you release (open source?) your changes to the source code of MySQL?
      August 21, 2008 at 10:30am
    • Vince Gatto
      Obviously I don't know the technical details of the problem you're trying to solve, but instead of modifying MySQL to parse a new syntax could you have done something like:

      REPLACE INTO profile (`first_name`) VALUES ('Monkey') WHERE `user_...id`='jsobel' /*!10000000 MEMCACHE_DIRTY 'jsobel:first_name'*/

      This is MySQL's standard compatibility comment structure, which in this example says that versions of MySQL less than 1000 should ignore the statement. So while MySQL won't try to process this extra cache flushing information, it will be written to the binary logs and sent over the wire for replication.

      After that, all you need is a process on the slave side which scans the binary logs and flushes your cache.
      See More
      August 21, 2008 at 10:49am
    • Jason Sobel
      Wow, exciting to see so much discussion! I will do my best to address outstanding questions, here goes:

      @Yishan: You're clearly a minion of Godzilla and I can no longer trust you.

      @James: Great observation. At the simplest level, what if ...that cache set happens to hiccup and take longer than the MySQL replication stream? It would overwrite Virginia cache with an old, stale value.
      To the more general question of our delete-and-set cache instead versus a write-through cache, this is an issue we are looking at in the next iteration of our architecture. Interestingly, most of our cache misses come from keys that have aged out, not keys that were deleted because of an update. Thus the value of write-through, while noticeable, is not as significant as you might think. This statistic was a surprise to me!

      @Geoffrey: We don't upgrade our MySQL version very frequently, but when we do we simply merge this change in as a patch. It is surprisingly simple, about 150 lines of code, so it's a pretty painless merge.

      @Domenic: You’re right; it pretty much is a hack! We considered a lot of options (I’ll discuss one of them in my next reply) and this turned out to be the best performance-elegance tradeoff.
      See More
      August 21, 2008 at 12:49pm
    • Jason Sobel
      ‎@Joseph: I’m reading that page as suggesting we do:
      :
      INSERT INTO ‘foo’ (`bar`, `keys_to_dirty`) VALUES (‘val’, ‘foo_bar_key’)

      We didn’t want to add a new column to every table since, on the master, the memcache key list, which can be qui...te long, would take up cache (and disk) space. A similar idea we did consider was adding a table to every database that used the blackhole storage engine. Then, instead of:

      INSERT INTO ‘foo’ (`bar`) VALUES (‘val’) MEMCACHE_DIRTY ‘foo_bar_key’

      we’d do:

      INSERT INTO ‘foo’ (`bar`) VALUES (‘val’) MEMCACHE_DIRTY ‘foo_bar_key’
      INSERT INTO ‘blackhole’ (`keys`) VALUES (‘foo_bar_key’)

      and then have a trigger on the blackhole table to dirty the keys. This is a more elegant solution but increases the number of queries we issue and triggers were not as efficient as compiling in the code directly. The cost wasn’t bad because SELECTs are the majority of our workload (~75%) and don’t dirty any keys, but it still came out to a ~10% performance price that we didn’t want to pay.
      See More
      August 21, 2008 at 12:49pm
    • Jason Sobel
      ‎@Yingkuan: This is actually as close to the exact setup as I can easily describe, but you make great point about various caching policies.

      @Mustafa: We feel like this approach does scale out to other datacenters, with the obvious limitati...on that there is a single master to accept writes. On the upside we can just keep buying servers in that datacenter to keep up with the write traffic. The downside is the extra latency paid by users who are “far” from that datacenter on their write operations.

      The twenty second number is a decent upper bound based on our observation of the replica streams. In the rare cases where the replication latency crosses that boundary, a background monitoring script disables the replica db so that reads do not see stale data. This causes a degrade user experiences as reads to that db either fail or cross the country, but it protects users from inconsistency (which is arguably worse). Another option we haven’t gotten around to testing is programmatically altering the rule in the load balancer to change the threshold based on the current maximum lag, up to some limit.
      See More
      August 21, 2008 at 12:50pm
    • Jason Sobel
      ‎@Charles: These changes are very specific to our environment and focus mainly around the specific memcache client we use in house. That being said I’m happy to send you the diff if you’re interested, send me a private message and we’ll fig...ure it out!

      @Vince: This idea is how we originally formulated the process, actually, and is totally reasonable. My understanding of MySQL replication (and I could be wrong, somebody correct me) is that the relay log on the slave is updated before the MySQL instance actually runs the query. Thus you have a potential race condition where you delete the key before the new value is stored and a subsequent request will trap the old value in cache. There are ways to work around that by using the slave’s replica lag plus some cushion as a delay before dirtying keys, but we liked the hard connection between a query successfully completing and the keys it affects being deleted.
      See More
      August 21, 2008 at 12:50pm
    • George Kong Great post Sobel. So if I understand you correctly, if the West Coast datacenter goes down right now, all of the write pages will be down? Or is there provisions to bring up the VA datacenter as a master if California falls into the Pacific?
      August 21, 2008 at 1:25pm
    • Vince Gatto
      ‎@Jason: Using SHOW SLAVE STATUS's Relay_Log_Pos should get you out of any potential race conditions, but I can understand if the latency of post-processing the binary logs is unacceptable in Facebook's situation.

      All things considered, ev...en this glimpse of what I'm sure is a much more complicated solution is much appreciated. Is this the first modification to MySQL Facebook has made or is this only one of several?See More
      August 21, 2008 at 1:47pm
    • Yingkuan Liu ‎>>@Vince: This idea is how we originally formulated the process...

      Jason,

      I think Vince had suggested to use Bin log on slave side instead of relay-log. Of course most likely you will had log_slave_updates to OFF. It might be a good idea to turn on binlog on one of the slaves and update cache based on the binlog of it. With binlog turned on this slave should be one with most lag.
      August 21, 2008 at 1:51pm
    • Jason Sobel
      ‎@Kong: Correct, if west coast is down the site is in very bad shape. We're working on a DR procedure where we "upgrade" Virginia to the master but it's a tricky process because of all the one-off services we have running in California that... need to be properly replicated.

      @Vnice: Definitely, that's what I meant by "...using the slave's replica lag plus some cushion..." As I said I think it's absolutely a reasonable approach but building the logic directly in to MySQL just "felt" better, especially given how small the code change is. We don't have to worry about the health of the scraper service; as long as MySQL is running, keys are getting dirtied.
      See More
      August 21, 2008 at 2:15pm
    • Joseph Hsieh ‎@Vince - the binary logs are probably distributed, causes more time to elapse, and have other interdependency problems. I think they want to set memcache dirty bit as close to the txn as possible.
      August 21, 2008 at 2:37pm
    • Joseph Hsieh Oops - didn't referesh page. Saw the other responses now...
      August 21, 2008 at 2:39pm
    • Michael Chermside
      I agree with Sathia:

      Why send all traffic to CA for 20 seconds after each client performs an edit? If the only goal is for that client to see the changes that they made, then just do it clilent-side.

      One good answer would be that the chan...ge may affect things that aren't obvious and that the client-side code can't figure out. And perhaps you just don't have a problem with using the CA site (after all, until recently it ran 100% of traffic, so sending some extra portion of the traffic that way should be acceptable). Or it could be that this caching info is, by design, in an infrastructure layer and the application code (including the client) is intentionally kept completely unaware of it.

      So there are lots of possible reasons for not using the client-side approach... what were your ACTUAL reasons?
      See More
      August 21, 2008 at 7:42pm
    • Jason Sobel
      ‎@Michael: We didn't consider this idea actually so I'm making the response up on the fly. Hopefully it makes some sense.

      I'm reading the idea as storing, in a cookie, all the modifications made by a page. Then we'd use the data in the coo...kie until some condition is met (timer expires, cookies matches the data in the replica, etc.).

      My initial reaction is that I worry about that "some condition." If you blindly wait 20 seconds you'll get stale data because in that window multiple updates could have come in from other users (there's an open question on how much staleness of data our users can tolerate). If you try to compare to data in the replica db, what are you looking for? A version number might work and seems the safest.

      Secondarily, since data modifications can be large, this will balloon the size of our HTTP requests and possibly require multiple packets to send one request. If we have to send multiple packets to express the whole HTTP request it could actually end up costing more in aggregate than the one time cost of going to California!

      Finally, as you mention, we don't get a capacity savings since this request had to be served somewhere. There's no "extra" hardware involved, just the California vs. Virginia latency. Also, from our studies, most users look at 2-3 pages a minute, so the 20 second number works out pretty nicely assuming a reasonable distribution of page views.
      See More
      August 22, 2008 at 10:07am
    • Matthieu Aubry thanks for great article! I love it especially because its so rare that huge companies relying on php/mysql share their insights :-)
      August 22, 2008 at 12:01pm
    • Jason Sobel
      ‎@Sathia: This won't work, what if you're updating an object shared with other users? As I said in my original reply, I think version numbers would work. Very neat idea. My concern around packets/request holds, and I'd be a bit nervous abou...t browser limits on cookies, and as I said I'm not sure the benefit is that huge given our workload (most pages don't do writes, and 20 seconds works out well).

      Still, a really clever idea!
      See More
      August 22, 2008 at 12:19pm
    • Mustafa K. Isik ‎@Jason:

      Thanks for the detailed answer/reply to my comment.
      August 24, 2008 at 1:33pm
    • Michael Whalen Wow reading about this stuff makes me wet, awesome guys. So cool that you share the inner workings of Facebook with all of us geeks.
      August 25, 2008 at 1:46am
    • Jawad Shuaib very interesting! thank you
      August 25, 2008 at 2:33pm
    • Will Dowling
      My initial reaction was also why not to use MySQL triggers, though I can definately appreciate that multiple memcache objects might be invalidated.

      The only other solution that comes to mind is to write to an in-memory table the keys you w...ant to invalidate, but this only works in practice if you can make sure the second INSERT is executed within the required time after the original query - and in fact having to do two inserts (even though one is to an in-memory table) is probably much slower than your solution.

      Anyway, thanks for the great post Jason - I've only recently stumbled on the Engineering@FB page and will be reading avidly :)
      See More
      August 25, 2008 at 11:55pm
    • Adam Gries thanks!!
      August 25, 2008 at 11:56pm
    • Fernando Padilla So.
      Any discussion on the load balancer/DNS setup?
      How do you assign users to each datacenter?
      August 29, 2008 at 11:21pm
    • Jia Liu thanks for the nice article, hoping to read more about the insider stories.
      October 5, 2008 at 11:29am
    • Artem Russakovskii Great article. I wish everything was so easy - you're assuming your slave lag is at most 20 seconds. If you can keep it there consistently - congratulations, you have designed an almost perfect app.

      For people with more serious replication lag problems, I wrote this article: MySQL Slave Delay Explained And 7 Ways To Battle It (http://beerpla.net/2008/09/05/mysql-slave-delay-explained-and-7-ways-to-battle-it)
      October 13, 2008 at 11:25pm
    • Kwame Thomison awesome stuff
      October 14, 2008 at 9:15pm
    • Anirudh Badam Looks like this is some sort of eventual consistency for profile updates or name changes etc.......but do you have a mechanism to ensure strong consistency or there is nothing on the main facebook (not apps) pages that needs strong consistency as such?
      October 17, 2008 at 12:26pm
    • Chen Maosheng After reading this awesome stuff and nice replies, I've very interest in memcached which seems better than my previous DB caching. Before using it, I have one question, I know memcache can handle web service efficiently, is it also suitable for MMO games?
      November 30, 2008 at 9:57pm
    • Greg Kostello Great article. Extremely clear with some very real, very critical issues raised and addressed. So, as a follow-up, now that facebook is in multiple countries and regions, how have you needed to modify this process at all to handle N data centers.
      January 13, 2009 at 4:07pm
    • Alice Cheng What a great article. Thank you for sharing!!!
      January 14, 2009 at 6:14pm
    • David Hazelkoff f
      February 25, 2009 at 8:29am
    • Jonathan Ozeran This is a fantastic article - thanks for the transparency and your unique perspective on the challenge!
      March 7, 2009 at 12:41pm
    • Matthew Sinclair nice
      but need more data centers!!!!!!
      April 11, 2009 at 2:47pm
    • Shu Chen Good solution, thanks for sharing!
      December 28, 2009 at 8:44am
    • Nilesh Gamit Nice article...
      January 12, 2010 at 6:19am
    • Ivan Kurnosov wonderful article. and nothing general still changed after 1.5yrs pasts?
      January 17, 2010 at 10:45pm
    • Dan Horne
      Here's an issue. I add a comment to a friend's status. It shows up in the news feed, where I initially saw the status entry. However, it doesn't show up on her profile page.

      I assumed that she had deleted it, but FB still showed it on the ...news feed for some reason. It turned out that after many minutes (I'm not sure of the exact amount, but it would be over 5), the comment appeared in her profile.

      So it seems to me after reading this article, that there could be some delay between the replication from the writing db to the reading db, and that the memcache version of the profile isn't being invalidated on all relevant updates.
      See More
      January 21, 2010 at 8:33am
    • Akande Olajide Kayode
      Hello Facebook Great Team,
      I really like what you are doing infact your operation make the world more smaller and you have construct a bridge that link everyone and to make far away relatives, friends and family feel more closer. You doing ...good well done job here.
      One thing that I will like you to add again on this newpage is the Recently added and the Alpherbetical order name of friends that you have removed, please if you can add it again on the new page it will be complete.
      Thanks
      See More
      February 9, 2010 at 4:27am
    • Bin Hu very infomative, thanks.
      July 7, 2010 at 6:21am
    • Narendra Singh Solanki good article and solution for social network solution but it would be a challenge for low latency data solution for financial and other data critical application.
      November 3, 2010 at 4:17pm
    • Che YongGang Hi Jason, since Facebook setup MySQL Replication across the WAN, do you use some specific technology to make sure it works well, such as VPN?
      January 12 at 8:28pm