Chaos Engineering at Netflix – What It Is & How It Works

Elixirr Digital

7 years ago

It has been a while now since Netflix first announced a brand new infrastructure tool to cope with their complex technology stack. Since the original announcement, six to seven years ago, they’ve further developed the tool – expanding its purpose and capabilities.

For those that haven’t heard of this before, they’ve named it The Simian Army. The idea alone is enough to make any developer sweat nervously. The quirky named tool was built to not only stress test their servers, but also test their engineers on a daily basis when a system failure occurs.

What started off as a crazy idea called Chaos Monkey, was expanded further to incorporate a new animal-like testing tool called Chaos Kong as well as other similarly named tools. The idea behind the tools was that outages on major websites happen all the time, and as such engineers need to be made aware of that and also able to cope in a stressful situation where services may be scarce or even unavailable. This is what is known as Chaos Engineering.

Netflix itself is a worldwide service used by millions and millions of people every day. Those people aren’t just browsing, they’re streaming large amounts of media content for hours at a time. Many developers, system engineers and administrators have to think about availability of services all the time and at Coast Digital, we’re always thinking of ways to speed up websites and ensuring they’re up even through busy periods.

For a giant corporation like Netflix; this problem is unbelievably magnified.

The Army of Monkeys

So, what does Chaos Monkey actually do? The original monkey was designed to take down operational servers in a production environment randomly at will throughout the day. Even at weekends! Yes, that’s right. Production servers are purposefully taken offline and stressed tested, constantly putting a hindrance on the service day in and day out.

“Since we knew that server failures are guaranteed to happen, we wanted those failures to happen during business hours when we were on hand to fix any fallout.”

– Blog Post: Chaos Engineering Updated

Now it may seem crazy, stupid in fact. But the engineers at Netflix argued (quite rightly too) that it shouldn’t matter when a server is knocked offline. You should be able to deal with an issue like this because sometimes it really does happen.

A number of years later, they went one step further and created Chaos Kong. Whilst the smaller, less aggressive Chaos Monkey would be taking servers offline, Chaos Kong is a bit bigger and much angrier.

So, whilst an individual server in one region (for example the West Coast of the United States) going offline isn’t too much to worry about, having an entire region within a service going offline is definitely a concern.

An entire region being taken offline is a completely different and an incredibly dangerous thing to do. So, Chaos Kong does just that. It will randomly take an entire region offline.

For argument sake, say the U.S. West Coast region went down, you need to mitigate the outage by redirecting all service traffic elsewhere to either the East Coast or other European regions, whichever is more operationally beneficial.

Testing Your Systems

As you may know, testing your website is an absolute must-do nowadays. With high speed internet connections and people accessing data all the time, it’s important to make sure that your service is always available to the people who want to use it and quickly. With this in mind, you need to take certain steps and plan ahead in case anything disastrous happened with your website.

“This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.”

– Blog Post: The Netflix Simian Army

Keeping Your System Operational

Before we put a website live at Coast Digital, we always test it to make sure it performs properly and as the end-user expects it to. But once the website has gone live, the testing and monitoring doesn’t stop there. Ensuring service uptime is maintained requires a huge amount of effort and attention to detail. Spotting spikes in traffic and usage patterns can help to reduce the risk of downtime and keep customers happy.

Whilst the Netflix approach isn’t suitable for every business, when working at a large International scale it is definitely an interesting approach to consider.

Websites can be prone to going offline all the time, not just accidentally but maliciously too. A method like this not only aids in training engineers to deal with such issues regularly, it also acts as a line of defense against Distributed Denial of Service attacks which in today’s digital age is all too common.

A technique like this is a win-win situation if you’re protecting yourself against attacks and also training employees on how to react in an operational crisis like this.

What do you think to Netflix’s approach to service failure? Is this something that could be deployed to your infrastructure too? If so, what are the benefits that you would expect to see? Let us know what you think about Chaos Engineering!