Microsoft says Nov. 18 Azure outage due to misconfigured code
Microsoft Azure, its cloud platform for businesses suffered a major outage on November 18th which left many users in a lurch. In a statement, Microsoft claimed that the outage was caused due to its developers implementing bad code.
Microsoft developers it seems, were working to iron out a bug in their software. The solution to the bug solving, apparently caused the massive outage of Azure cloud services. Microsoft said that it had tested the update before rolling it out. But it is not always possible to accurately predict the result of a software update on such a huge platform under a controlled testing environment. Therefore, Microsoft follows a policy of deploying any new update section by section, something they term flighting aka limiting the rollout. This time though, probably because of over eagerness, the developers deployed the full update package all at once, which caused a cascading effect across all its servers. In a statement issued on Azure blog, Jazon Zander, CVP, Azure Team, noted that,
As a consequence, connection rates to Azure on Nov. 18 dropped from 97% to 7%-8% after 7 p.m. Eastern in Northern Virginia. The Azure data center in Dallas suffered a complete outage for a short while. Data centers in Europe didn’t recover until deep into the following day.
He further added that though they have a standard deployment policy for updation/patching of bugs, there were apparent miscommunications, “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed,” wrote Zander. Zander said that their team had identified the key problem, which was a configuration issue in Azure Table storage front ends. “The configuration switch was incorrectly enabled for Azure Blob storage front-ends,” wrote Zander.
Table storage front-ends record the sequence of the different data types going into a Blob (a service for storing large amounts of unstructured data) and can be used to guide the data’s retrieval. The error in the configuration switch appears to have caused an infinite loop which ultimately resulted in outage of Azure cloud service.
The original update was meant to patch some bugs found out by the Azure team and improve performance of the cloud platform. The update proved itself in every test in the alpha testing phase. The successful results of the alpha testing probably over excited the developers to forgo the flighting method of deployment and they implemented the update all at once. The result as seen on Nov, 18 was a all out outage causing problems to users. As a response, Azure administrators have now implemented automated update practice, that will not allow such an event to occur again.
In perhaps the clearest outcome of the incident, Zander wrote: “Microsoft Azure had clear operating guidelines, but there was a gap in the deployment tooling that relied on human decisions … With the tooling updates, the policy is now enforced by the deployment platform itself.”
Zander acknowledged that cloud operations must become more reliable and said Microsoft will continue to work on that goal. “We sincerely apologize and recognize the significant impact this service interruption may have had on your applications and services,” he wrote.
Zander may have sincerely apologised for the over excitement of his dev team but the fact remains that Microsoft is pushing patches and updates in hurry without necessary testing. This issue has already Microsoft to release buggy patches/updates which pop up BSOD, in its Patch Tuesday updates twice, once in October with KB 2949927 and in December with KB 3004394, only to remove them later. Users will hope that Microsoft comes out with a SOP for letting out only those updates/patches which have been verified in real time working environment.