Data Centers Are Crashed By Facebook Engineers In Real-World Stress Test
Jay Parikh, Facebook vice president of engineering, says, “It’s easier to take a data center down than to put it back together.” Thanks to regular takedowns of Facebook’s data center intended to stress test the company’s disaster recovery efforts, the company’s software engineers are improving at the joining it back portion.
Dubbed as “Project Storm,” Parikh explained the effort to the audience of invited engineers at the third annual @Scale conference held in San Jose this week. @Scale brings together engineers who build or preserve systems designed for huge numbers of users, including companies like Google, Netflix, Airbnb, Spotify, Dropbox, and others.
The wake-up call for Facebook as to how natural disaster could cause a shut-down disaster was the mammoth 2012 hurricane, Super Storm Sandy that pushed its way up the East Coast from the Caribbean and made landfall near New York City, causing havoc on the East Coast internet infrastructure. The superstorm endangered two of Facebook’s data centers, each carrying tens of terabits of traffic. While many companies were disturbed for days and weeks, Parikh said Facebook’s both centers got through Sandy unscathed, but barely. He said the experience made the company realize it might not be as fortunate the next time.
“We had built up enough redundancy over the years that we weathered the storm, unintended,” he said. “We really came pretty close for us. While we got through this and we didn’t see any major disruptions in our service, we asked ourselves: what would happen if we lost a data center region or a data center due to something like this storm?”
The scale at which Facebook operates apparently compounds its resiliency challenges.
“We all care about scale,” he told the @Scale audience, “we’re obsessed with running things at high volume – lots of customers, lots of people dependent on our applications and services. And we’re solving unprecedented problems, these are problems that are not being solved anywhere in the industry, generally speaking. So…every day is chock full of lot of scalability problems.”
“Instead of just kind of wondering and assuming we’d probably be OK,” Parikh said, the company in 2014 created a SWAT team called Project Storm comprising the leaders of the various Facebook technology groups who, in turn, marshalled the complete engineering workforce to figure out the answers.
The group developed tools and checklists of tasks both manual and automated; and they set time standards for completing each task. We wanted, Parikh said, “to run like a pit stop at a race; to get everything fixed on the car in the shortest period of time, realizing, however, that this is like taking apart an aircraft carrier and putting it back together in a few hours, not just taking apart a toy that I got for Christmas.”
The SWAT team started with a series of mini shut down drills and began development of a preliminary emergency system before completing deciding to “pull the plug and see what happened.”
“To be honest things didn’t go all that well the first few times we did this,” Parikh said. “But because we built a lot of instrumentation tooling and preparedness ahead of time, (end users) in the community didn’t notice what happened. We learned a lot and this was exactly the goal of the drill. We wanted to force ourselves to look at what would work and what didn’t work with this massive type of drill that we did.”
The major lesson learned: traffic management load balancing is really hard. During the initial drills, “all hell broke loose,” Parikh said, when the team began the drain of a large set of software systems.
The group began running a number of tests and fine-tuning mechanisms for shifting traffic should a data center drop from the network, Parikh reported.
In 2014, Parikh decided Project Storm was ready for a real-world test: During a normal working day, the team would take down an actual data center and see if they could organize the traffic shift easily.
Parikh recalls that other Facebook leaders didn’t think he would actually do it. “I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.
That first takedown, which involved almost the entire engineering team and a lot of people from the rest of the company, turned out to be a little messy at least from the inside. However, users didn’t appear to notice. A chart tracking the traffic loads on various software systems was presented by Parikh —something that should have displayed smooth curves.
“This is an awful user experience when this happens and you don’t have a good control system,” he said in reference to the above image. “If you’re an engineer and you see a graph like this, three things come to mind. One: either you have bad data and you should go fix it; two: you’ve a control system that’s not working and you should go fix it; or three: you have no idea what you’re doing, and you probably should go fix that.”
“This is what we got – much, much better,” Parikh said. “You can see drain on the left side and all that traffic is automatically absorbed, it kind of goes on the other services, the other capacity picks it up, you see there’s low variance here, it kind of looks pretty boring, it looks pretty nice. So we strive to make our graphs for our drill exercises to look like this graph.”
As of today, the live takedowns still continue, with the Project Storm team members coming up with sillier and wilder goals for just what to take offline, Parikh says. “You need to push yourself to an uncomfortable place to get better.”