Facebook Engineers Crashing Data Centers In Real-World Stress Test

Data Centers Are Crashed By Facebook Engineers In Real-World Stress Test

Jay Parikh, Facebook vice president of engineering, says, โ€œItโ€™s easier to take a data center down than to put it back together.โ€ Thanks to regular takedowns of Facebook’s data center intended to stress test the company’s disaster recovery efforts, the companyโ€™s software engineers are improving at the joining it back portion.

Dubbed as โ€œProject Storm,โ€ Parikh explained the effort to the audience of invited engineers at the third annual @Scale conference held in San Jose this week. @Scale brings together engineers who build or preserve systems designed for huge numbers of users, including companies like Google, Netflix, Airbnb, Spotify, Dropbox, and others.

The wake-up call for Facebook as to how natural disaster could cause a shut-down disaster was the mammoth 2012 hurricane, Super Storm Sandy that pushed its way up the East Coast from the Caribbean and made landfall near New York City, causing havoc on the East Coast internet infrastructure. The superstorm endangered two of Facebookโ€™s data centers, each carrying tens of terabits of traffic. While many companies were disturbed for days and weeks, Parikh said Facebook’s both centers got through Sandy unscathed, but barely. He said the experience made the company realize it might not be as fortunate the next time.

โ€œWe had built up enough redundancy over the years that we weathered the storm, unintended,โ€ he said. โ€œWe really came pretty close for us. While we got through this and we didnโ€™t see any major disruptions in our service, we asked ourselves: what would happen if we lost a data center region or a data center due to something like this storm?โ€

The scale at which Facebook operates apparently compounds its resiliency challenges.

โ€œWe all care about scale,โ€ he told the @Scale audience, โ€œweโ€™re obsessed with running things at high volume – lots of customers, lots of people dependent on our applications and services. And weโ€™re solving unprecedented problems, these are problems that are not being solved anywhere in the industry, generally speaking. Soโ€ฆevery day is chock full of lot of scalability problems.โ€

โ€œInstead of just kind of wondering and assuming weโ€™d probably be OK,โ€ Parikh said, the company in 2014 created a SWAT team called Project Storm comprising the leaders of the various Facebook technology groups who, in turn, marshalled the complete engineering workforce to figure out the answers.

The group developed tools and checklists of tasks both manual and automated; and they set time standards for completing each task. We wanted, Parikh said, โ€œto run like a pit stop at a race; to get everything fixed on the car in the shortest period of time, realizing, however, that this is like taking apart an aircraft carrier and putting it back together in a few hours, not just taking apart a toy that I got for Christmas.โ€

The SWAT team started with a series of mini shut down drills and began development of a preliminary emergency system before completing deciding to โ€œpull the plug and see what happened.โ€

โ€œTo be honest things didnโ€™t go all that well the first few times we did this,โ€ Parikh said. โ€œBut because we built a lot of instrumentation tooling and preparedness ahead of time, (end users) in the community didnโ€™t notice what happened. We learned a lot and this was exactly the goal of the drill. We wanted to force ourselves to look at what would work and what didnโ€™t work with this massive type of drill that we did.โ€

The major lesson learned: traffic management load balancing is really hard. During the initial drills, โ€œall hell broke loose,โ€ Parikh said, when the team began the drain of a large set of software systems.

The group began running a number of tests and fine-tuning mechanisms for shifting traffic should a data center drop from the network, Parikh reported.

In 2014, Parikh decided Project Storm was ready for a real-world test: During a normal working day, the team would take down an actual data center and see if they could organize the traffic shift easily.

Parikh recalls that other Facebook leaders didnโ€™t think he would actually do it. โ€œI was having coffee with a colleague just before the first drill. He said, โ€˜Youโ€™re not going to go through with it; youโ€™ve done all the prep work, so youโ€™re done, right?โ€™ I told him, โ€˜Thereโ€™s only one way to find outโ€™โ€ if it works.

That first takedown, which involved almost the entire engineering team and a lot of people from the rest of the company, turned out to be a little messy at least from the inside. However, users didnโ€™t appear to notice. A chart tracking the traffic loads on various software systems was presented by Parikh โ€”something that should have displayed smooth curves.

โ€œThis is an awful user experience when this happens and you donโ€™t have a good control system,โ€ he said in reference to the above image. โ€œIf youโ€™re an engineer and you see a graph like this, three things come to mind. One: either you have bad data and you should go fix it; two: youโ€™ve a control system thatโ€™s not working and you should go fix it; or three: you have no idea what youโ€™re doing, and you probably should go fix that.โ€

โ€œThis is what we got โ€“ much, much better,โ€ Parikh said. โ€œYou can see drain on the left side and all that traffic is automatically absorbed, it kind of goes on the other services, the other capacity picks it up, you see thereโ€™s low variance here, it kind of looks pretty boring, it looks pretty nice. So we strive to make our graphs for our drill exercises to look like this graph.โ€

As of today, the live takedowns still continue, with the Project Storm team members coming up with sillier and wilder goals for just what to take offline, Parikh says. โ€œYou need to push yourself to an uncomfortable place to get better.โ€

Source: ieee

Subscribe to our newsletter

To be updated with all the latest news

Kavita Iyer
Kavita Iyerhttps://www.techworm.net
An individual, optimist, homemaker, foodie, a die hard cricket fan and most importantly one who believes in Being Human!!!

Subscribe to our newsletter

To be updated with all the latest news

Read More

Suggested Post