Facebook Engineering's Notes
View: Full | Compact
One of the statistics we are most proud of at Facebook is our ratio of users to engineers. When I joined the company in January of 2006, we had 5 million users being supported by about 15 engineers, a ratio of about 300,000 users per engineer. We have more than doubled the size of our engineering team every year since then, but our user growth has far outpaced us. Today there are roughly 1.2 million users per engineer.
In the summer of 2008, as the engineering team was poised to pass Dunbar's number, we decided to try something new to help us scale. Every new engineer that joined Facebook, whether a recent college grad or a new director, would go through an intensive six week program designed to immerse the new engineer into our code base, give greater flexibility in choosing a project, and promote the types of habits that would allow us to scale up our organization. That program is called Bootcamp.
The primary goal of Bootcamp is to get people up to speed on our all parts of our code base while promoting good habits that we believe will pay dividends in the long term, such as fearlessly fixing bugs as we come across them rather than leaving them for future engineers. We have high expectations for our engineers and part of Bootcamp is making sure those expectations are met. A small number of rotating senior engineers serve as mentors and meet with the new engineers regularly to coach them on how to be more effective at Facebook. The mentors review all the bootcampers' code and even hold office hours to answer any basic questions that engineers might otherwise be too timid to ask. Senior engineers from across the engineering team also give a bunch of tech talks on a broad range of the technologies we use from MySQL and Memcache to CSS and Javascript. Even with all this support, most bootcamp graduates agree that the most valuable part of bootcamp is the tasks they are assigned. Engineers have real work assigned to them the first time they open their laptops and many push code to the live site within their first week. Whether it is fixing bugs from the live site, building internal tools, or making improvements to our infrastructure, most bootcamp graduates agree that there is simply no better way to learn than by diving into the code.
Bootcamp also helps educate engineers about the many opportunities at Facebook, ensuring that they wind up on the teams and projects that they are most passionate about and where they feel they can make the biggest impact. Instead of assigning engineers to teams arbitrarily based on a small amount of interaction during interviews, bootcampers choose the team they will join at the end of their six weeks. This gives them an opportunity to meet with the various teams and even fix some bugs in the different code bases before committing to join a team. They also have access to Facebook's strategic priorities so they know where they will be able to have the largest impact and can weigh that against their interests. We believe Engineers are at their most productive when they work on things they are passionate about. Matching engineers with the teams that they are excited to join and where they can have a big impact is one way of achieving that goal.
As the program has developed we've noticed a number of additional benefits. One of the most obvious perks is that we now have a pretty large workforce of highly motivated and talented engineers working on bugs that might not otherwise get engineering attention. In addition to that, by centralizing the mentoring and onboarding responsibilities, we've greatly decreased the costs hiring has on the rest of the organization in terms of time spent showing people the ropes and keeping our standards consistent, which allows us to take our rapid organizational growth in stride. The Bootcamp mentor program is also great opportunity for developing leadership internally, essentially serving as a meta-bootcamp for potential managers and technical leaders. Perhaps one of the most surprising and positive results has been the fact that bootcampers tend to form bonds with their classmates who joined near the same time and those bonds persist even after each has joined different teams fostering cross team communication and preventing the silos that so commonly spring up in growing engineering organizations. Finally, bootcamp provides us with a unique opportunity to take our experiences working with new engineers and use it to fine tune our interview process in a very tight loop to figure out what types of things to look out for in resumes, references, and interview feedback.
The Bootcamp program, like most things at Facebook, is constantly evolving to better fit our needs. Some of the improvements are just a matter of incorporating feedback we get from the many talented engineers in Bootcamp at any given time. Perhaps more revealing about the program, however, is the fact that many former graduates of the program have returned as mentors to help improve the program that helped them when they first joined. Together, we work hard to make sure that every engineer has all the tools, knowledge and support to be able to hit the ground running and make changes that positively impact hundreds of millions of users, whether it is your 1st week or your 201st like me.
Andrew Bosworth is the Bootcamp drill sergeant.
The Bootcamp program, like most things at Facebook, is constantly evolving to better fit our needs. Some of the improvements are just a matter of incorporating feedback we get from the many talented engineers in Bootcamp at any given time. Perhaps more revealing about the program, however, is the fact that many former graduates of the program have returned as mentors to help improve the program that helped them when they first joined. Together, we work hard to make sure that every engineer has all the tools, knowledge and support to be able to hit the ground running and make changes that positively impact hundreds of millions of users, whether it is your 1st week or your 201st like me.
Andrew Bosworth is the Bootcamp drill sergeant.
At Facebook, we're always looking for ways to make sharing more efficient. Today we're announcing a significant upgrade to our Photos product: a new and improved photo uploader that’s available for testing as a Facebook Prototype.
Since Photos launched in 2005, the photo-uploading experience on Facebook has relied on the use of a third-party ActiveX control (and its sister Java applet).
Over the years we have seen a growing number of complaints with this old uploader. In a recent poll, we discovered a significant percentage of users were unable to upload photos due to technical issues. Many more found it functional, but only just. That's when we resolved to build a modern replacement.
We had the following goals for the new uploader:
With these goals in mind we considered some of the existing options out there -- Adobe Flash, Google Gears, Yahoo! Browser Plus -- but none of them carried the specific functionality we needed. This brought on an interesting challenge: We'd build a headless browser plug-in that could securely provide powerful JavaScript APIs -- like filesystem browsing, background uploading threads, and thumbnailing -- to our front-end code. With this system we are able to build the uploader's UI entirely in HTML and CSS, where iteration is cheap and easy.
When you use the new photo uploader for the first time, you'll be asked to install the Facebook Plug-In. This installation flow has been engineered to be as seamless as possible. If you have the Java runtime installed, we use a small applet to download the installer and run it in the background. If not, we provide a direct download link to an installer for your platform. In either case, no browser restart is required to continue.
Once you've installed the plug-in, you can enjoy the new photo uploader. As soon as it's opened, the uploader immediately takes advantage of the new custom APIs we've built: There is a simple photo browser built entirely in HTML and CSS right there in the Facebook frame. [While it looks like magic, it's really just a bunch of cool hacks.] The photos in thumbnail view are served by a light-weight local web server thread, while the filesystem information is provided through a JavaScript API.
Since Photos launched in 2005, the photo-uploading experience on Facebook has relied on the use of a third-party ActiveX control (and its sister Java applet).
Over the years we have seen a growing number of complaints with this old uploader. In a recent poll, we discovered a significant percentage of users were unable to upload photos due to technical issues. Many more found it functional, but only just. That's when we resolved to build a modern replacement.
We had the following goals for the new uploader:
- Don't depend on Java.
- Uploads should be asynchronous; that is, you should be able to browse around on Facebook while an upload is ongoing.
- The uploader's UI needs to integrate well with current and future revisions of Facebook's chrome.
- UI iteration needs to be easy (no recompiling code).
- Updates and deployment of any binaries should be as secure, seamless and user-friendly as possible.
With these goals in mind we considered some of the existing options out there -- Adobe Flash, Google Gears, Yahoo! Browser Plus -- but none of them carried the specific functionality we needed. This brought on an interesting challenge: We'd build a headless browser plug-in that could securely provide powerful JavaScript APIs -- like filesystem browsing, background uploading threads, and thumbnailing -- to our front-end code. With this system we are able to build the uploader's UI entirely in HTML and CSS, where iteration is cheap and easy.
How it Works
When you use the new photo uploader for the first time, you'll be asked to install the Facebook Plug-In. This installation flow has been engineered to be as seamless as possible. If you have the Java runtime installed, we use a small applet to download the installer and run it in the background. If not, we provide a direct download link to an installer for your platform. In either case, no browser restart is required to continue.
Once you've installed the plug-in, you can enjoy the new photo uploader. As soon as it's opened, the uploader immediately takes advantage of the new custom APIs we've built: There is a simple photo browser built entirely in HTML and CSS right there in the Facebook frame. [While it looks like magic, it's really just a bunch of cool hacks.] The photos in thumbnail view are served by a light-weight local web server thread, while the filesystem information is provided through a JavaScript API.
After selecting photos and starting an upload, you'll discover another great feature--asynchronous uploading. The plug-in spins off a background thread to perform the upload regardless of what the browser is doing in the foreground. Then the local web server provides JSONP endpoints to retrieve progress information or cancel the ongoing upload, without needing to re-embed the plug-in itself. With this approach you can even navigate away from Facebook entirely without worrying about your uploads.
Security
Security is our top concern with this project. Part of the reason we're making this feature available early through prototypes is to solicit your feedback. We've spent long hours architecting a secure experience. Here are some of the key points:
- The local web server that serves thumbnails and other special API endpoints only runs bound to 127.0.0.1, and its secure URIs are protected with a hashing mechanism.
- The plug-in will refuse to run on non-Facebook domain names.
- In the event of an XSS hole on Facebook or a network hijacker, our plug-in has strong mechanisms to prevent unauthorized access to trusted functionality. (To learn more, please see this discussion topic.)
- All code downloads are securely signed and verified, including the entire install flow and any future updates.
- In the unlikely event of a security hole in the plug-in itself, Facebook can easily deactivate it remotely using a "kill switch.” This is achieved every time the plug-in starts up by connecting securely to Facebook servers and comparing a minimum-version number.
If you believe you have discovered a security concern with the browser plug-in or any other part of this project, we would greatly value your input. Please send your security reports to fbplugin-feedback@publists
Rollout
Our new photo uploader is currently available as a Facebook Prototype for testing purposes. To try it out, you can visit the Prototypes section of our Application Directory and activate the prototype “New Photo Uploader.” Depending on the results of these tests, we hope to roll it out to all users soon.
As you might be aware, one of Facebook's oldest and greatest traditions is the Hackathon. Every few months, our engineers decide to stay up all night and just build it.
Hackathon is a chance to work on the ideas we've been thinking about for the last couple months, to turn "Wow, wouldn't it be cool if..." into "Hey, I just built this cool thing." The evening typically starts with takeout Chinese food and ends with a dawn breakfast at a local pancake house. In between, Facebook engineers unleash their coding and design talents to build prototypes of projects they've always wanted to build but never had time during their regular hours. Sometimes the best way to find out if an idea is good is to try it out, and that's part of what Hackathon lets us do.
Hackathon is a chance to work on the ideas we've been thinking about for the last couple months, to turn "Wow, wouldn't it be cool if..." into "Hey, I just built this cool thing." The evening typically starts with takeout Chinese food and ends with a dawn breakfast at a local pancake house. In between, Facebook engineers unleash their coding and design talents to build prototypes of projects they've always wanted to build but never had time during their regular hours. Sometimes the best way to find out if an idea is good is to try it out, and that's part of what Hackathon lets us do.
A couple days later, we hold what we call a Prototype Forum and show off the fruits of our labors. Some ideas flop, some make it into Facebook Prototypes, others may start being actively developed into a full-fledged product for global release.
This time, we'd like to share Hackathon with you. Starting the evening of Tuesday, November 3rd at 9pm at night and continuing till dawn the next morning, wherever you work, wherever you are in the world - feel free to offset the hours as you like - we invite you to join in the spirit of hackathon by building that idea you've been thinking about and put together a working version to show the world. We want to do this because we feel that innovation is best demonstrated by bringing ideas to life, even if the first version may not be perfect.
This time, we'd like to share Hackathon with you. Starting the evening of Tuesday, November 3rd at 9pm at night and continuing till dawn the next morning, wherever you work, wherever you are in the world - feel free to offset the hours as you like - we invite you to join in the spirit of hackathon by building that idea you've been thinking about and put together a working version to show the world. We want to do this because we feel that innovation is best demonstrated by bringing ideas to life, even if the first version may not be perfect.
If you'd like to join (or even just to show your support), sign up at this Facebook Hackathon Event, and on the night of Hackathon, join us in creating something! A new website, a clever widget, a fun game, a cool data visualization, a new tools library, a compelling design, or even an art project.
Once you're done, show it off to your friends and family, or post links or video demos of what you've made to our Hackathon Page and check out other peoples' work.
Who knows, maybe you'll find someone whose work kicks off a project you'd like to collaborate on it and it becomes the seed of the next great idea! We hope to be able to do this for all our future Hackathons, so help us spread the word!
Yishan Wong, an engineer at Facebook, looks forward to hacking on Tuesday night.
Once you're done, show it off to your friends and family, or post links or video demos of what you've made to our Hackathon Page and check out other peoples' work.
Who knows, maybe you'll find someone whose work kicks off a project you'd like to collaborate on it and it becomes the seed of the next great idea! We hope to be able to do this for all our future Hackathons, so help us spread the word!
Yishan Wong, an engineer at Facebook, looks forward to hacking on Tuesday night.
Our Translations app allows users (translators) to click on a phrase as they browse the site, and see the original native string, vote on translations suggested by their peers or contribute their own. Here at Facebook, we offer an innovative approach to web site internationalization that leverages a unique infrastructure and a dedicated user community to keep our interface up-to-date in translation. Phrases can be translated inline and in bulk mode.
Figure 1. Inline translation of a previously untranslated (red underlined) phrase with a token (in Nepali).
In this process, our users have produced a parallel corpus of more than 4 million phrases in over 90 languages, which is still growing daily. We use {token}s to incorporate dynamic content (including names of users, applications, events, dates and times) so each phrase (template) has "Named Entities" annotated in source and each translation (several per language). This is a valuable resource for research in natural language processing (NLP) and linguistics.
On August 7th, I presented some of our ongoing work in "Social (distributed) language modeling, clustering and dialectometry" at a workshop in Singapore. In abstract, "...a scalable implementation of over 250 million individual language models, each capturing a single user’s dialect in a given language (multilingual users have several models). These have a variety of practical applications, ranging from spam detection to speech recognition, and dialectometrical methods on the social graph. Users should be able to view any content in their language, and to browse our site with appropriately translated interface (automatically generated, for locales with little crowd-sourced community effort)."
Translating variable content, represented in tokenized phrases, is problematic because the values (words) need to be inflected in some languages. In our translation architecture, we have several systems in place to facilitate quality and enable correct translations, requiring minimal effort from translators.
- Glossary, to ensure consistent vocabulary for frequently occurring (and often critical) terms.
- Dynamic Explosion, to separate translation of a single phrase into variants that depend on features of token value.
- Linguistic Rules, to properly handle inflection of variable text (especially difficult for personal names).
Each of these incorporates NLP technology, and is supported by functional and unit testing.
Glossary
Before inline translation, our translator community for a new language must translate and vote on glossary terms. In the main phase of translation, a string might include several glossary terms (e.g., “Connect with your friends by commenting on their actions in News Feed.”, where both “friends” and “News Feed” are glossary terms), each of whose translations should be used in translation of the containing phrase.
When a translation is submitted (via the inline dialog in Figure 1), before accepting it, we check for the use of glossary terms. This could be accomplished with a simple string search, but the term might appear inflected in the target language. There are several possible approaches:
1. Lemmatize both translation and glossary term; then check (string match) for glossary term stem.
2. Exhaustively inflect glossary term; search for each alternative.
3. Apply phonological rules to glossary term stem, enumerating effects of possible distinguishing contexts (e.g. in Finnish: open or closed syllable with front or back vowel).
This last is a more efficient hybrid, although to implement in a truly cross-lingual fashion requires encoding (or preferably inference) of rules in a collapsed fashion, sensitive to classes of context.
Dynamic Explosion
This technique allows us to split strings on language-specific variations based on translator feedback. For example, a Hebrew translator indicates that in the phrase “{name} wrote on your wall”, the verb conjugation depends on the gender of the subject. Translators can then submit (and vote on) translations for each case: where the actor is male, female, or unspecified.
In Arabic, there are different inflections for singular, dual and plural, so in the phrase “number hours ago”, the value of the number affects the translation. Translators can easily see and modify each of these translations, and the appropriate variant is shown to Arabic users (in this case, in their newsfeeds). In order to account for dependencies on token value, we associate variant translations with a bitmask. At render time, the particular value is tested across each dimension in a language-specific set, and the variation bitmask selects the appropriate form for the translation.
Linguistic Rules
Orthographic or phonological rules can affect the spelling of words, and are applied automatically when tokens are substituted with their values. For example, Turkish inflection rules affect any token in possessive, dative or accusative case, such that there are 12 different forms for each. We allow translators to use a proto-form that will be adjusted to match the token when displayed.
Specifically,
“{name1} wrote on {name2}’s wall”
is translated as follows:
“{name1} {name2}’(n)in duvarına yazdı”
If {name2} is “Malmö”, it will be displayed as “...Malmö’nün...”
But if {name2} is “Barış”, it will be displayed as “...Barış’ın...”
Our phonological rules system can import rules encoded in standard rewrite-style (including feature-based) or in two-level formalism, and should ideally also handle optimality theoretic constraints, thus easily drawing on extant literature and other data sources for a wide variety of languages.
Future Work
Social natural language processing is (in a sense) in its infancy. We hope to capture aspects of its evolution, just as the field comes to better describe and understand ongoing changes in human languages. We expect more fine-grained analyses to follow, using our framework to compare and contrast a variety of languages (from Bantu to Balinese) and phenomena (inside jokes, cross-linguistic usage of l33t and txt msg terms).
Acknowledgments
David Ellis has drawn on his background in computational linguistics (research in academia and industry) to help build a community-driven translation tool. This technology continues to be developed with support from the i18n team (engineers, language managers, interns and others) at Facebook, and all our international users. If you are passionate about translation, machine learning, or large-scale modeling of dynamic systems on the social graph, try a puzzle or two, and join the geekery.
Facebook's News Feed has become increasingly feature rich over time, allowing users to interact with their friends' stories in new and interesting ways. The introduction of the Like feature, launched in February, has been a huge success with an average of 0.3 Likes per user on the average day. The comment and Like box appearing below most feed stories is known as the "Universal Feedback Interface," or UFI to Facebook engineers, and is an important aspect of communication among Facebook users.
These new methods of communication do come at a cost, though. In July, Lior Abraham, an engineer on the Feed team, presented the results of a study he had done on the markup size of the home page. He discovered that 34% of the markup on the home page was due to the UFI. Since the UFI appears on nearly every page on the site, reducing the markup on this feature would lead to page size wins across the board. In particular, the Like feature was selected as an aspect of the UFI that could be slimmed down.
These new methods of communication do come at a cost, though. In July, Lior Abraham, an engineer on the Feed team, presented the results of a study he had done on the markup size of the home page. He discovered that 34% of the markup on the home page was due to the UFI. Since the UFI appears on nearly every page on the site, reducing the markup on this feature would lead to page size wins across the board. In particular, the Like feature was selected as an aspect of the UFI that could be slimmed down.
The desired behavior of the Like feature is fairly simple: when a user clicks Like, the Like sentence above the comment box should update to include the user, and the Like link should change to an Unlike link. The previous implementation of Like accomplished this by sending two complete sets of markup to the user. One set included the current user in the list of people who like the item and had an Unlike link, and one excluded the user in that list and had a Like link. CSS rules then ensured the user always saw the applicable sentence and link. When the user clicked Like, a javascript function switched between the two links and sentences. This approach made the Like feature feel quick to the user. Because all the markup was already loaded, there was no delay between when the user clicked Like and when the link and sentence updated.
The issue with this approach was its markup cost. Every time a user did not click Like on a given story, the hidden sentence and link were wasted. Those bytes could have never been sent, and the user would not have noticed. While the Like feature is well-used, a slight change in user experience would be well worth it if the size of home.php could be significantly reduced by omitting the second set of markup.
To do so, a small modification was made to the Like functionality. In the modified implementation, when the user clicks Like, the page makes an asynchronous call to Facebook's servers, which returns the new markup for display. Once that asynchronous call returns, the new markup is inserted into the page in the exact same spot as in the prior implementation. Because the markup structure remains the same as before, nearly all the javascript behind the Like functionality could remain unchanged.
With this new implementation, the behavior of the Like link stayed mostly unchanged, but there was a slight issue. The Like link would only change to Unlike after the asynchronous call returned, since the second link was omitted from the initial page. This meant that the user had no visual feedback that the Like action had been triggered until after an asynchronous call had been made, which significantly degraded the user's experience. Compared to the sentence, though, the link was a very small piece of markup, and it costs very little to send both links. Hence, we went back to sending both links to the user on initial page load, while waiting until the user clicked to send the new sentence. This led to a slight decrease in markup savings, but made the user experience with the new system nearly indistinguishable from the old system.
While seemingly a minor change, the savings from this proved to be quite substantial. A UFI with no likes or comments used to be 1.7 kilobytes, despite not even being displayed on the page unless the user clicked Like. This change reduced the size of that empty UFI to under 100 bytes. A UFI with just a like sentence was also around 1.7 kilobytes which was reduced to around 900 bytes with this change.
In early August, around three weeks after the presentation on markup size, the updated Like markup was pushed to the live web servers. At that time, the average size of home.php was around 250 kilobytes. This change decreased the average size of that page by 20 kilobytes. Without reducing the user experience, the homepage of Facebook had become around 8% slimmer. Finding small changes like this to make the site faster is one way that we can help improve Facebook for our users.
Dan Schafer, a summer engineering intern, just started his senior year at Carnegie Mellon University, and hopes this page loaded just a little bit faster for everyone.
Site speed has always been an important factor in the development of Facebook, even as the site evolves over time to become more feature-rich and complex. As we grow beyond the 250 million user mark, every small change to the site causes a huge ripple, affecting throngs of web surfers and their experience on Facebook. My project this summer as an engineering intern on the Infrastructure team involved tackling this imposing fact by exploring data and finding out how various changes to fundamental parts of the user experience impacted and changed user behavior.
Over this summer, I conducted a variety of experiments on sample tests of users, tweaking subtle parts of the site for each experimental group to investigate some of the questions often posed within Facebook as theories and hypotheses. For example, what exactly is the user behavior effect of a slower site or of one perceived to be faster? How do usage and interaction patterns change when these factors change? What about the number of stories we show in News Feed -- what's the performance vs. usage exchange there? How about the appearance of a page while it's loading? How does a user who casually uses Facebook become affected by certain site changes compared to a user who uses Facebook many times throughout their day? These questions and more are what I sought to answer through experimentation and data.
Here are a few learnings I found during my research:
Site Speed
What happens to user behavior when we tweak the site to be slower in various degrees for them? It turns out that over a large gradient of site slowdowns, users in general spend around the same amount of time on Facebook, as measured by session time (user activity up until a certain period of idleness). Logically, page views suffer as a result. If pages take longer to load but people still spend the same amount of time on Facebook, then the number of page views is inversely proportional to the page loading time. On the other side of the coin, it means that since people tend to spend the same amount of time on Facebook, improving site speed would allow them to explore much more content and discover more of the network around them each visit.
Page Loading
Another debate within Facebook pitted two loading schemes against each other. The debate was over whether or not we render the page as soon as possible. This runs the risk of showing users links and content that they won't get a response from until the interaction scripts load -- an experience some employees claim to be potentially infuriating. An alternative is to show nothing until the JavaScript is primed to load, and then to render everything at once. That way, the page appears to load really quickly, and the content is interactive almost immediately after rendering. I ran tests for groups of users, comparing one scheme against the other with the effect amplified to see the results. In all groups of users, keeping the page blank resulted in lower usage statistics. Thus the debate was resolved.
Scroll Loading
A while back, some clever engineers at Facebook tackled the site speed problem on News Feed in an interesting way. We showed 30 stories in the News Feed, but when users loaded the home page they seldom saw more than 10. Since loading stories was a significant time expense in loading the page, the concept of scroll loading was introduced -- loading 15 stories when the page is loaded and then loading the other 15 (through an Ajax hook) when the user had scrolled down far enough to need 15 more stories. The home.php load time was reduced and the user experience wasn't harmed. It was a good gain for us.
But this concept had more potential in it. If people didn't mind the time delay as stories loaded when they scrolled down, what if we took this opportunity to show them more stories in the News Feed? One would point to the 'Older Posts' link for that purpose, but the link often went unloved - perhaps because people thought of it as a stopping point or because people simply didn't notice it at all.
I performed an experiment gauging user behavior when we showed 30 more stories once a user scrolled down instead of 15 (bringing the total to 45), and the results were very encouraging: usage statistics rose across the board for many pages. For the casual users who made up the bottom 25 percent of usage statistics, we saw a gain of over 20% in page views and time spent. This was a huge win, especially for the casual users who we have a hard time engaging. Soon after, the scroll-loading paradigm was changed to load 30 stories. It's not often that research bears fruit this quickly.
All of these important conclusions from the data I collected were just a part of the exciting work that I've done this past summer during my internship at Facebook. There are always more theories to confirm, more details to find and more variables to optimize -- and there's nothing better than the quick hard facts to answer the questions that they bring.
Over this summer, I conducted a variety of experiments on sample tests of users, tweaking subtle parts of the site for each experimental group to investigate some of the questions often posed within Facebook as theories and hypotheses. For example, what exactly is the user behavior effect of a slower site or of one perceived to be faster? How do usage and interaction patterns change when these factors change? What about the number of stories we show in News Feed -- what's the performance vs. usage exchange there? How about the appearance of a page while it's loading? How does a user who casually uses Facebook become affected by certain site changes compared to a user who uses Facebook many times throughout their day? These questions and more are what I sought to answer through experimentation and data.
Here are a few learnings I found during my research:
Site Speed
What happens to user behavior when we tweak the site to be slower in various degrees for them? It turns out that over a large gradient of site slowdowns, users in general spend around the same amount of time on Facebook, as measured by session time (user activity up until a certain period of idleness). Logically, page views suffer as a result. If pages take longer to load but people still spend the same amount of time on Facebook, then the number of page views is inversely proportional to the page loading time. On the other side of the coin, it means that since people tend to spend the same amount of time on Facebook, improving site speed would allow them to explore much more content and discover more of the network around them each visit.
Page Loading
Another debate within Facebook pitted two loading schemes against each other. The debate was over whether or not we render the page as soon as possible. This runs the risk of showing users links and content that they won't get a response from until the interaction scripts load -- an experience some employees claim to be potentially infuriating. An alternative is to show nothing until the JavaScript is primed to load, and then to render everything at once. That way, the page appears to load really quickly, and the content is interactive almost immediately after rendering. I ran tests for groups of users, comparing one scheme against the other with the effect amplified to see the results. In all groups of users, keeping the page blank resulted in lower usage statistics. Thus the debate was resolved.
Scroll Loading
A while back, some clever engineers at Facebook tackled the site speed problem on News Feed in an interesting way. We showed 30 stories in the News Feed, but when users loaded the home page they seldom saw more than 10. Since loading stories was a significant time expense in loading the page, the concept of scroll loading was introduced -- loading 15 stories when the page is loaded and then loading the other 15 (through an Ajax hook) when the user had scrolled down far enough to need 15 more stories. The home.php load time was reduced and the user experience wasn't harmed. It was a good gain for us.
But this concept had more potential in it. If people didn't mind the time delay as stories loaded when they scrolled down, what if we took this opportunity to show them more stories in the News Feed? One would point to the 'Older Posts' link for that purpose, but the link often went unloved - perhaps because people thought of it as a stopping point or because people simply didn't notice it at all.
I performed an experiment gauging user behavior when we showed 30 more stories once a user scrolled down instead of 15 (bringing the total to 45), and the results were very encouraging: usage statistics rose across the board for many pages. For the casual users who made up the bottom 25 percent of usage statistics, we saw a gain of over 20% in page views and time spent. This was a huge win, especially for the casual users who we have a hard time engaging. Soon after, the scroll-loading paradigm was changed to load 30 stories. It's not often that research bears fruit this quickly.
All of these important conclusions from the data I collected were just a part of the exciting work that I've done this past summer during my internship at Facebook. There are always more theories to confirm, more details to find and more variables to optimize -- and there's nothing better than the quick hard facts to answer the questions that they bring.
Zizzy is back to school as a Junior in Carnegie Mellon University and already missing his summer.
Facebook engineers are a fiercely competitive group. Competition helps drive many of our core values, including moving fast and making a big impact. It seems natural, then, that we’re involved with TopCoder, a company that runs online programming competitions ranging from hour-long algorithm contests to longer competition-driven application development. TopCoder gives programmers around the world the opportunity to show off their coding kung fu, sometimes for cash and sometimes for bragging rights.
We have a lot of past and current TopCoder competitors among our ranks at Facebook, myself included. The competitors here have a variety of backgrounds: some have spent years before college training specifically for competitive coding and others picked it up fairly late in their formal education. I fall into the latter category, and had my first experience in competitive programming through TopCoder in the middle of my college career. A friend had harassed and shamed me until I finally agreed to compete against him, and I went in cold with essentially no preparation other than my basic programming knowledge. I lost, but it didn’t really matter. It just made me want to get better.
We have a lot of past and current TopCoder competitors among our ranks at Facebook, myself included. The competitors here have a variety of backgrounds: some have spent years before college training specifically for competitive coding and others picked it up fairly late in their formal education. I fall into the latter category, and had my first experience in competitive programming through TopCoder in the middle of my college career. A friend had harassed and shamed me until I finally agreed to compete against him, and I went in cold with essentially no preparation other than my basic programming knowledge. I lost, but it didn’t really matter. It just made me want to get better.
Faced with an objective measure of how I stacked up against the people around me, I started preparing for the contests more seriously. I put in hours each day cranking through practice problems, learning tricks and approaches that I never encountered in school. Long story short, TopCoder went a long way to get me motivated and interested in writing awesome code (assuming my code had been, until that point, sub-awesome). I can’t say I’d have landed a job as a software engineer at Facebook without the learning that the competitions provided.
So, I’m personally excited that Facebook is sponsoring the next TopCoder single-round match (SRM 447). Seasoned veterans and newcomers alike will get to duke it out in real time to see who can solve the toughest problems the fastest. We're planning on distributing cash prizes to top performers, and all competitors will have the chance to opt in to be contacted by Facebook recruiting if they put on an impressive show.
So, I’m personally excited that Facebook is sponsoring the next TopCoder single-round match (SRM 447). Seasoned veterans and newcomers alike will get to duke it out in real time to see who can solve the toughest problems the fastest. We're planning on distributing cash prizes to top performers, and all competitors will have the chance to opt in to be contacted by Facebook recruiting if they put on an impressive show.
Anyone interested in competing in the Facebook sponsored match on August 25th (or any other single-round match) should take a look at TopCoder’s docs on how to compete. Their practice problems, accessible from the Arena applet, as well as Facebook Puzzles are great fodder for getting your brain and your fingers warmed up and ready to compete. Keep an eye out for Facebook engineers in the Arena during the hour before the match; we will be around to chat and answer any questions that come our way. Hope to see you all there, and good luck!
Tim Stanke, an engineer at Facebook, will see you in the Competition Arena.
Tim Stanke, an engineer at Facebook, will see you in the Competition Arena.
We recently hit a milestone of 50MM usernames a few weeks ago — in just over a month since we launched usernames on June 12. Ever since we launched usernames, we’ve had a lot of people express interest in understanding how we designed the system and prepared for this big event. In a recent post, my colleague Tom Cook wrote about the site reliability and infrastructure work that we did to ensure a smooth launch. As an extension to that post, I’ll discuss some specific application and system design issues here.
Launching usernames to allow over 200 million (at the time — we’re now over 250 million) people to get a username at the same time presented some really interesting performance and site reliability challenges. The two main parts of the system that needed to scale were (1) the availability checker and (2) the username assigner. Since we were pre-generating suggestions for users, we needed to check availability of all the suggested names, which placed extra load on the availability checker.
It became clear to us that the database tier would not be able to handle the huge initial load for availability checks. Even caching the results of availability check calls would not have helped much since the hit rates would be low. To solve these problems, we created a separate memcache tier to store all assigned usernames. Checking if a username is available is just a quick lookup in this memcache tier. If the lookup returns no result, we assume the name is available. This allowed us to completely eliminate any dependency on the database tier for availability checks. To distribute the availability check load across several memcache nodes, we replicated the cache across several machines in each of our data centers. We allocated about 1TB of memory for the entire username memcache tier. This design meant that we were using memcache as the authoritative data source for checking availability. This was a non-trivial decision to make, and we had to design special fault tolerance mechanisms (described in the next section) to make this reliable.
When a username is assigned, the data is written to the database for a reliable, persistent record of the transaction. The username is also added to all the nodes in the replicated memcache tier. Writing the data to multiple memcache nodes implied a slight drop in write performance, but since we expected the read load to be much higher than the write load, this was a good trade-off to make. For detecting conflicts when multiple users try to grab a name at the same time, we used an optimistic concurrency control mechanism. This improved write performance by eliminating the need to hold locks.
We also briefly considered using Bloom Filters, but quickly came to the conclusion that it wasn’t the best solution for our problem because (1) space efficiency was not a primary concern since we could fit many hundreds of millions of usernames in memory in a single machine (2) Bloom Filters can cause false positives (incorrect hits) and (3) removing items from them is not simple.
One of the issues with using memcache as the system of record for availability checks is that memcache nodes can go down. While the redundant memcache boxes provided some fault tolerance, we wanted to design a system that would allow us to bring back failed nodes easily. To enable that, we wrote a script that can populate a username memcache node from log files that contain all the assigned usernames. The log files are written to Scribe as part of assigning a username. We also used the Scribe logs to build a real-time data collector that provided a real-time report of the number of assigned usernames with a latency of just a few seconds.
Another issue with memcache is that writes to memcache are not transactional and hence are not guaranteed to succeed 100% of the time. This means that we might occasionally say that a username is available when it really isn’t. Note that if the user tries to grab such a name, it will fail since the database is the ultimate source of truth. However, since this is not an ideal user experience, we made our system robust through a couple of mechanisms. First, to reduce the probability of incorrect misses, we always check a second memcache node for any miss in the first node. Second, any time that the process of assigning a username fails in the database due to an already assigned name, we will re-populate all memcache nodes with that name so that we can prevent future users from experiencing the same problem.
Since it was difficult to get accurate estimates on the number of users that would login at launch time to get a username, we tested our systems under fairly high load – in fact, we stress tested our system with more than 10x the load we actually saw at launch time. The load testing helped us identify several problems in our infrastructure including but not limited to (1) improperly configured networks (2) bottlenecks in our database id generation mechanism (to generated primary keys for certain objects) (3) capacity bottlenecks for write traffic originating from our east coast data center.
Since we didn’t have accurate estimates on the traffic that the launch would generate, we put together contingency plans to decrease the load on various parts of the site to give us extra capacity in core services at the expense of less essential services. Also affectionately referred to as “nuclear options”, some of these levers included disabling chat notifications, showing fewer stories in the home page and profile page and completely turning off other parts of the site such as the People You May Know service, the entire chat bar, etc.
Our careful design and planning paid off on launch night and later. In the first three minutes over 200,000 people registered names, with over 1 million allocated in the first hour, and over 50MM in just over a month. Through the entire launch we had no issues handling the additional load and none of our "nuclear options" had to be used at any point. However, two of our memcache nodes went down a few weeks after launch, but our scribe log replay scripts helped us to bring them back up again.
Srinivas Narayanan, an engineer at Facebook, is excited about being able to visit most of his friends’ profiles through their usernames.
Launching usernames to allow over 200 million (at the time — we’re now over 250 million) people to get a username at the same time presented some really interesting performance and site reliability challenges. The two main parts of the system that needed to scale were (1) the availability checker and (2) the username assigner. Since we were pre-generating suggestions for users, we needed to check availability of all the suggested names, which placed extra load on the availability checker.
Optimizing read and write performance
It became clear to us that the database tier would not be able to handle the huge initial load for availability checks. Even caching the results of availability check calls would not have helped much since the hit rates would be low. To solve these problems, we created a separate memcache tier to store all assigned usernames. Checking if a username is available is just a quick lookup in this memcache tier. If the lookup returns no result, we assume the name is available. This allowed us to completely eliminate any dependency on the database tier for availability checks. To distribute the availability check load across several memcache nodes, we replicated the cache across several machines in each of our data centers. We allocated about 1TB of memory for the entire username memcache tier. This design meant that we were using memcache as the authoritative data source for checking availability. This was a non-trivial decision to make, and we had to design special fault tolerance mechanisms (described in the next section) to make this reliable.
When a username is assigned, the data is written to the database for a reliable, persistent record of the transaction. The username is also added to all the nodes in the replicated memcache tier. Writing the data to multiple memcache nodes implied a slight drop in write performance, but since we expected the read load to be much higher than the write load, this was a good trade-off to make. For detecting conflicts when multiple users try to grab a name at the same time, we used an optimistic concurrency control mechanism. This improved write performance by eliminating the need to hold locks.
We also briefly considered using Bloom Filters, but quickly came to the conclusion that it wasn’t the best solution for our problem because (1) space efficiency was not a primary concern since we could fit many hundreds of millions of usernames in memory in a single machine (2) Bloom Filters can cause false positives (incorrect hits) and (3) removing items from them is not simple.
Fault tolerance
One of the issues with using memcache as the system of record for availability checks is that memcache nodes can go down. While the redundant memcache boxes provided some fault tolerance, we wanted to design a system that would allow us to bring back failed nodes easily. To enable that, we wrote a script that can populate a username memcache node from log files that contain all the assigned usernames. The log files are written to Scribe as part of assigning a username. We also used the Scribe logs to build a real-time data collector that provided a real-time report of the number of assigned usernames with a latency of just a few seconds.
Another issue with memcache is that writes to memcache are not transactional and hence are not guaranteed to succeed 100% of the time. This means that we might occasionally say that a username is available when it really isn’t. Note that if the user tries to grab such a name, it will fail since the database is the ultimate source of truth. However, since this is not an ideal user experience, we made our system robust through a couple of mechanisms. First, to reduce the probability of incorrect misses, we always check a second memcache node for any miss in the first node. Second, any time that the process of assigning a username fails in the database due to an already assigned name, we will re-populate all memcache nodes with that name so that we can prevent future users from experiencing the same problem.
Load Testing
Since it was difficult to get accurate estimates on the number of users that would login at launch time to get a username, we tested our systems under fairly high load – in fact, we stress tested our system with more than 10x the load we actually saw at launch time. The load testing helped us identify several problems in our infrastructure including but not limited to (1) improperly configured networks (2) bottlenecks in our database id generation mechanism (to generated primary keys for certain objects) (3) capacity bottlenecks for write traffic originating from our east coast data center.
Contingency Planning
Since we didn’t have accurate estimates on the traffic that the launch would generate, we put together contingency plans to decrease the load on various parts of the site to give us extra capacity in core services at the expense of less essential services. Also affectionately referred to as “nuclear options”, some of these levers included disabling chat notifications, showing fewer stories in the home page and profile page and completely turning off other parts of the site such as the People You May Know service, the entire chat bar, etc.
Our careful design and planning paid off on launch night and later. In the first three minutes over 200,000 people registered names, with over 1 million allocated in the first hour, and over 50MM in just over a month. Through the entire launch we had no issues handling the additional load and none of our "nuclear options" had to be used at any point. However, two of our memcache nodes went down a few weeks after launch, but our scribe log replay scripts helped us to bring them back up again.
Srinivas Narayanan, an engineer at Facebook, is excited about being able to visit most of his friends’ profiles through their usernames.
Today we celebrate the 10th Annual System Administrator Appreciation Day. Sysadmins work throughout Facebook Ops, IT and Engineering 24 hours a day, 7 days a week to keep the critical elements of site services up and running. They make an impact just about everywhere; from internal systems and tools to production applications like Photos, Facebook Connect, and News Feed.
Sysadmins are often the invisible heroes behind a company's success. A salesperson might get a bonus for exceeding sales goals, a software engineer might be featured in a magazine or a newspaper for a breakthrough product, but a system administrator...well, they usually just equate success with not getting paged at 2 in the morning.
At Facebook, a dedicated group of sysadmins have labored tirelessly to scale our website to serve over 250 million users, others have built out the infrastructure that supports our network of employees across the globe, and altogether they've made a substantial contribution to Facebook's mission to give people the power to share and make the world more open and connected.
Please take some time today to thank a sysadmin you know for the work that they do to keep things like your email, fileservers, and favorite websites running at peak performance. If you’re a system administrator yourself and the idea of supporting the infrastructure behind one of the most trafficked sites on the Internet makes your mouth water, be sure to check out our open positions at facebook.com/careers.
To learn more about System Administrator Appreciation Day, visit www.sysadminday.com.
[N.B.: The note below profiles some research that was conducted last year at Facebook based on the old News Feed. The resulting paper was recently presented at the International AAAI Conference on Weblogs and Social Media (ICWSM) conference in May 2009, where it received the Best Paper award. The full paper can be found at http://cameronmarlow.com/p apers/gesundheit-modeling]
How do certain celebrities, movies, or bands become really popular? It’s a mystery that intrigues everyone from high school students and academic researchers to ad agency executives. Some people, like author Malcolm Gladwell of "The Tipping Point", contend that popularity is driven by a few key influencers who get everyone else to join them. Others, like sociologist Duncan Watts, give less credit to individual trendsetters, arguing that the key lies in vast groups of closely connected people who are easily convinced to try something new.
Facebook is a great place to test these theories. If you have ever become a fan of a Facebook Page that grew to be wildly popular--or that never attracted as big a following as you expected--you've probably wondered how that came to be. Several of us on the Facebook Data Team have been analyzing the ways fan support has mobilized across thousands of Pages, covering topics as diverse as TV shows such as “Battlestar Galactica,” musicians such as Snoop Dogg and even philosophers such as Plato.
After identifying Pages that attracted a lot of fans, we analyzed how each new wave of fans arrived. Thanks to Facebook’s News Feed, people constantly learn what Pages their friends are “fanning,” creating the opportunity to check out those Pages and become a fan, too. For example, I might fan a Page after noticing my fiancée fanned it first; three of my friends might follow suit after seeing that I’ve fanned that Page; and so on, creating a longer chain of connections each time. This sequence of connections is like a domino effect created by News Feed, enabling fan actions to evolve into ultra-long chains.
What’s even more striking is how a flurry of fanning or short chains, all started by many people acting independently, often merges together into one gigantic group of friends and acquaintances. This merging happens when one person fans a Page after seeing two or more friends fan that same Page. A case in point is a Page devoted to a popular European cartoon, Stripy. The diagram below shows the cartoon's close-knit communities of fans in both Bosnia (blue) and Slovenia (yellow). A few fans serve as the “bridge” that brings the two groups together. A third cluster of Croatian fans (green) hasn’t yet found its connecting bridge. Finally, there are a few fans from other countries (grey), perhaps Bosnian and Slovenian expatriates!
How do certain celebrities, movies, or bands become really popular? It’s a mystery that intrigues everyone from high school students and academic researchers to ad agency executives. Some people, like author Malcolm Gladwell of "The Tipping Point", contend that popularity is driven by a few key influencers who get everyone else to join them. Others, like sociologist Duncan Watts, give less credit to individual trendsetters, arguing that the key lies in vast groups of closely connected people who are easily convinced to try something new.
Facebook is a great place to test these theories. If you have ever become a fan of a Facebook Page that grew to be wildly popular--or that never attracted as big a following as you expected--you've probably wondered how that came to be. Several of us on the Facebook Data Team have been analyzing the ways fan support has mobilized across thousands of Pages, covering topics as diverse as TV shows such as “Battlestar Galactica,” musicians such as Snoop Dogg and even philosophers such as Plato.
After identifying Pages that attracted a lot of fans, we analyzed how each new wave of fans arrived. Thanks to Facebook’s News Feed, people constantly learn what Pages their friends are “fanning,” creating the opportunity to check out those Pages and become a fan, too. For example, I might fan a Page after noticing my fiancée fanned it first; three of my friends might follow suit after seeing that I’ve fanned that Page; and so on, creating a longer chain of connections each time. This sequence of connections is like a domino effect created by News Feed, enabling fan actions to evolve into ultra-long chains.
What’s even more striking is how a flurry of fanning or short chains, all started by many people acting independently, often merges together into one gigantic group of friends and acquaintances. This merging happens when one person fans a Page after seeing two or more friends fan that same Page. A case in point is a Page devoted to a popular European cartoon, Stripy. The diagram below shows the cartoon's close-knit communities of fans in both Bosnia (blue) and Slovenia (yellow). A few fans serve as the “bridge” that brings the two groups together. A third cluster of Croatian fans (green) hasn’t yet found its connecting bridge. Finally, there are a few fans from other countries (grey), perhaps Bosnian and Slovenian expatriates!
In fact, more than 90% of a popular Page’s fans can be part of a single group of people who are all somehow connected to one another. Typically, each of these large, close-knit communities contains thousands of separate starting points--individuals who independently decide to fan a particular Page. No single person is accountable for the popularity of a Page; instead, we consistently see that roughly 15% of all fans arrived independently and started their own chains (which merge together as the rest of the fan base takes shape). These patterns hold for Pages with a few thousand fans and for those with more than 50,000.
Eventually, we hope to gain an even deeper understanding of the ways that popularity spreads. From our work so far, it appears that the most explosively popular Pages catch on as closely connected groups of like-minded people contact one another. Individual influencers aren’t nearly as crucial to a Page’s success: Pages grow if people are easily engaged by the content, not because of the actions of a couple trendsetters. That may also be true in other areas--such as the ways that new Platform Applications catch on--or a different dynamic might apply. We’ll be analyzing word-of-mouth (or should we say, word-of-mouse) referrals to find out.

















