Netflix‘s director of cloud solutions Ariel Tseiltlin recently sat down at the annual CloudBeat conference and detailed how his company keeps your favorite movies and TV shows streaming seamlessly into your living room. His secret is the Netflix chaos monkey. Seriously.
Every weekday between 9am and 5pm, an army of malicious programs, affectionatly known as “chaos monkeys,” are unleashed upon Netflix’s information infrastructure. “Their sole purpose is to make sure that we’re failing in a consistent and frequent enough way to make sure that we don’t drift into overall failure,” Tseitlin said. The goal is to fail often and uncover potential problems before they become actual problems.
This method of “planned” outages and service troubles have created an extremely talented and flexible team, as well as a robust back-end capable of serving up 1/3 of peak internet traffic without interruption. Indeed, by routinely unleashing chaos monkey programs, Netflix has been able to create one of the strongest infrastructures on the web.
“The design premise there is that all of the architecture is resilient enough to retry and to begin re-serving the experience in a way that is completely transparent to the customer,” Tseitlin said. “You as the viewer should have no idea that the instance that was serving up your movie was just terminated.”
By creating an internal expectation that their services are constantly under attack, Netflix has crafted a culture that actively seeks out and solves problems rather than waiting for them to manifest themselves in expensive outages. One has to wonder what other firms and industries could benefit from their own chaos monkey.
LuckyRobot is brought to you by Frequency Group