Data Centers Are Crashed By Facebook Engineers In Real-World Stress Test
Jay Parikh, Facebook vice president of engineering, says, โItโs easier to take a data center down than to put it back together.โ Thanks to regular takedowns of Facebook’s data center intended to stress test the company’s disaster recovery efforts, the companyโs software engineers are improving at the joining it back portion.
Dubbed as โProject Storm,โ Parikh explained the effort to the audience of invited engineers at the third annual @Scale conference held in San Jose this week. @Scale brings together engineers who build or preserve systems designed for huge numbers of users, including companies like Google, Netflix, Airbnb, Spotify, Dropbox, and others.
The wake-up call for Facebook as to how natural disaster could cause a shut-down disaster was the mammoth 2012 hurricane, Super Storm Sandy that pushed its way up the East Coast from the Caribbean and made landfall near New York City, causing havoc on the East Coast internet infrastructure. The superstorm endangered two of Facebookโs data centers, each carrying tens of terabits of traffic. While many companies were disturbed for days and weeks, Parikh said Facebook’s both centers got through Sandy unscathed, but barely. He said the experience made the company realize it might not be as fortunate the next time.
โWe had built up enough redundancy over the years that we weathered the storm, unintended,โ he said. โWe really came pretty close for us. While we got through this and we didnโt see any major disruptions in our service, we asked ourselves: what would happen if we lost a data center region or a data center due to something like this storm?โ
The scale at which Facebook operates apparently compounds its resiliency challenges.
โWe all care about scale,โ he told the @Scale audience, โweโre obsessed with running things at high volume – lots of customers, lots of people dependent on our applications and services. And weโre solving unprecedented problems, these are problems that are not being solved anywhere in the industry, generally speaking. Soโฆevery day is chock full of lot of scalability problems.โ
โInstead of just kind of wondering and assuming weโd probably be OK,โ Parikh said, the company in 2014 created a SWAT team called Project Storm comprising the leaders of the various Facebook technology groups who, in turn, marshalled the complete engineering workforce to figure out the answers.
The group developed tools and checklists of tasks both manual and automated; and they set time standards for completing each task. We wanted, Parikh said, โto run like a pit stop at a race; to get everything fixed on the car in the shortest period of time, realizing, however, that this is like taking apart an aircraft carrier and putting it back together in a few hours, not just taking apart a toy that I got for Christmas.โ
The SWAT team started with a series of mini shut down drills and began development of a preliminary emergency system before completing deciding to โpull the plug and see what happened.โ
โTo be honest things didnโt go all that well the first few times we did this,โ Parikh said. โBut because we built a lot of instrumentation tooling and preparedness ahead of time, (end users) in the community didnโt notice what happened. We learned a lot and this was exactly the goal of the drill. We wanted to force ourselves to look at what would work and what didnโt work with this massive type of drill that we did.โ
The major lesson learned: traffic management load balancing is really hard. During the initial drills, โall hell broke loose,โ Parikh said, when the team began the drain of a large set of software systems.
The group began running a number of tests and fine-tuning mechanisms for shifting traffic should a data center drop from the network, Parikh reported.
In 2014, Parikh decided Project Storm was ready for a real-world test: During a normal working day, the team would take down an actual data center and see if they could organize the traffic shift easily.
Parikh recalls that other Facebook leaders didnโt think he would actually do it. โI was having coffee with a colleague just before the first drill. He said, โYouโre not going to go through with it; youโve done all the prep work, so youโre done, right?โ I told him, โThereโs only one way to find outโโ if it works.
That first takedown, which involved almost the entire engineering team and a lot of people from the rest of the company, turned out to be a little messy at least from the inside. However, users didnโt appear to notice. A chart tracking the traffic loads on various software systems was presented by Parikh โsomething that should have displayed smooth curves.
โThis is an awful user experience when this happens and you donโt have a good control system,โ he said in reference to the above image. โIf youโre an engineer and you see a graph like this, three things come to mind. One: either you have bad data and you should go fix it; two: youโve a control system thatโs not working and you should go fix it; or three: you have no idea what youโre doing, and you probably should go fix that.โ
โThis is what we got โ much, much better,โ Parikh said. โYou can see drain on the left side and all that traffic is automatically absorbed, it kind of goes on the other services, the other capacity picks it up, you see thereโs low variance here, it kind of looks pretty boring, it looks pretty nice. So we strive to make our graphs for our drill exercises to look like this graph.โ
As of today, the live takedowns still continue, with the Project Storm team members coming up with sillier and wilder goals for just what to take offline, Parikh says. โYou need to push yourself to an uncomfortable place to get better.โ
Source: ieee