Netflix’s Simian Army
The big question on your mind might be this: What happens if the #Amazon cloud fails?
That’s one reason it took Netflix seven years to make the shift to Amazon. Instead of moving existing systems intact to the cloud, Netflix rebuilt nearly all of its software to take advantage of a cloud network that “allows one to build highly reliable services out of fundamentally unreliable but redundant components,” the company says. To minimize the risk of disruption, Netflix has built a series of tools with names like “Chaos Monkey,” which randomly takes virtual machines offline to make sure Netflix can survive failures without harming customers. Netflix’s “Simian Army” ramped up with Chaos Gorilla (which disables an entire Amazon availability zone) and Chaos Kong (which simulates an outage affecting an entire Amazon region and shifts workloads to other regions).
Amazon’s cloud network is spread across 12 regions worldwide, each of which has availability zones consisting of one or more data centers. Netflix operates primarily in the Northern Virginia, Oregon, and Dublin regions, but if an entire region goes down, “we can instantaneously redirect the traffic to the other available ones,” Izrailevsky said. “It’s not that uncommon for us to fail over across regions for various reasons.”
Years ago, Netflix wasn’t able to do that, and the company suffered a streaming failure on Christmas Eve in 2012, when it was operating in just one Amazon region. “We’ve invested a lot of effort in disaster recovery and making sure no matter how big a failure that we’re able to bring things back from backups,” he said.
Netflix has multiple backups of all data within Amazon.
“Customer data or production data of any sort, we put it in distributed databases such as Cassandra, where each data element is replicated multiple times in production, and then we generate primary backups of all the data into S3 [Amazon’s Simple Storage Service],” he said. “All the logical errors, operator errors, or software bugs, many kinds of corruptions—we would be able to deal with them just from those S3 backups.”
What if all of Netflix’s systems in Amazon went down? Netflix keeps backups of everything in Google Cloud Storage in case of a natural disaster, a self-inflicted failure that somehow takes all of Netflix’s systems down, or a “catastrophic security breach that might affect our entire AWS deployment,” Izrailevsky said. “We’ve never seen a situation like this and we hope we never will.”
But Netflix would be ready in part thanks to a system it calls “Armageddon Monkey,” which simulates failure of all of Netflix’s systems on Amazon. It could take hours or even a few days to recover from an Amazon-wide failure, but Netflix says it can do it. Netflix pointed out that Amazon isolates its regions from each other, making it difficult for all of them to go out simultaneously.
“So that’s not the scenario we’re planning for. Rather it’s a catastrophic bug or data corruption that would cause us to wipe the slate clean and start fresh from the latest good back-up,” a Netflix spokesperson said. “We hope we will never need to rely on Armageddon Monkey in real life, but going through the drill helps us ensure we back up all of our production data, manage dependencies properly, and have a clean, modular architecture; all this puts us in a better position to deal with smaller outages as well.”
Netflix declined to say where it would operate its systems during an emergency that forced it to move off Amazon. “From a security perspective, it’d be better not to say,” a spokesperson said.
Netflix has released a lot of its software as open source, saying it prefers to collaborate with other companies than keep secret the methods for making cloud networks more reliable. “While of course cloud is important for us, we’re not very protective of the technology and the best practices, we really hope to build the community,” Izrailevsky said.