Chaos Engineering ideas are gathering steam these days. In August we ran an engineering-wide Game Day here at Quid, where we simulated real-world service failures and had “on-call teams” exercise their incident response skills. It was great fun and good training for Quid engineers. This post covers the organizational side of Game Day, lessons learned, and some technical tricks that others can use in similar exercises.
How do you do Chaos Engineering without, well, chaos?
Followers of Chaos Principles intentionally inject failures in production to test for reliability and recovery from faults. At Quid, we target enterprise clients with high-value use cases, and therefore steer clear of production tests that would put even a single user at risk of a poor experience. Instead, we apply Chaos Engineering in the staging environment.
This approach permitted us to take the system beyond its redundancy limits and generate failures that were highly visible, so we could train engineers in incident response on a realistic scenario where users would be complaining. We didn’t simply want to see if Quid can tolerate certain types of failure with barely a hiccup; we wished to create teachable moments.
How we ran Game Day
“We” here refers to the Quality Engineering team at Quid who organized this event. We care about such things as tests, bugs, failures, successes, regression coverage to keep bugs from escaping, escalation engineering for the ones that got away, alerts to let us know when things go wrong, and metrics to track it all. We advertised the idea of Game Day as training grounds for all Quid engineers to debug production issues and build up their incident response muscle, got organizational buy-in, planned the activities, and orchestrated the event. We did it, so can you!
The Game Day was planned as nearly a full day activity, with 3 “incidents” played out at the 10am, 11am, and 1pm hour, with a team-wide postmortem after 2pm. We canceled regular standups and other meetings to dedicate time to Game Day. Of course, real world incidents come up during standups and meetings, as well as nights and weekends, but for training purposes we figured a distraction-free work day is a good setup. Next time we’re doing it at 3am on Saturday, in production. Just kidding! Or are we?..
We split our engineering team into groups of 2 or 3 engineers, with newbie engineers riding along for learning. We created a separate private Slack channel for each “on-call team” for them to debug issues separately from each other, with a QE representative monitoring each channel and giving occasional hints if needed. The teams were requested to not go fixing things when they think they have a solution, but to report in their channel what that fix would be, since others could still be in investigation mode.
What happens during a real incident
- notice that an outage is happening
- report on http://status.quid.com/
- communicate with your on-call teammates
- report all-clear on http://status.quid.com/
- write an incident report
- participate in postmortem discussion
- attend to followup action items
We had the same expectations for Game Day incidents! Folks got to hone not only their debugging chops, but communication with their teammates, crafting the words for status updates, authoring an incident report, and participating in a blameless postmortem using the “5 Whys” approach.
We test in production, but not for Game Day. When we plan to create outages on purpose, we do that in staging.
Our staging environment mimics production closely and is subjected to high load from real human users. Thanks to the crowdsourced testing approach using RainforestQA service, we can throw as much load at staging as peak production loads, and the tester crowd follows our automated scripts to use the app, resulting in very realistic usage patterns. We also have our data acquisition pipeline forked to feed staging the same data as goes daily into production. One difference is a smaller ElasticSearch cluster in staging, since it would be prohibitively expensive to keep a full copy of our production cluster that houses an Internet’s worth of English language news for the last 4 years. We have found that keeping the last 3-6 months of data is sufficient for producing production-like behaviors in the ES cluster.
Importantly, we have the exact same Datadog-based alerting set up for staging as for production, with the only difference being the alert destination. Our team lives in Slack, and we have separate slack channels for staging alerts which we were watching carefully for this exercise.
Chaos engineers commonly use Netflix’s Simian Army and other fault injection tools to introduce randomized faults in the environment, which work great if you stay within the system’s intended failure tolerance limits. We needed precise control to create visible incidents.
1st incident: Resource Exhaustion
We first focused on a resource exhaustion (eg: CPU, memory, etc) scenario which would leave the system in a degraded state and impact end-user experience. This would cause alerts to fire (preferable) or complaints from end users. We used the stress-ng tool to consume 100% CPU on some ElasticSearch hosts, impacting our search service. The teams followed the trail of alerts to find the misbehaving hosts.
2nd Incident: Resource Flakiness
Flaky resource behavior is often hard to pinpoint since there may only be a few calls which results in some users experiencing a specific issue while others don’t seem to be affected at all. We chose to simulate network flakiness by using iptables tool to partially drop traffic.
This failure mode had some unintended consequences in our environment with service container restarts, making this incident extra fun to debug!
3rd Incident: Service Failure (ElasticSearch)
For our 3rd incident, we injected a fault into ElasticSearch by running a very heavy query with an aggregation on high-cardinality field directly against the ES API, which in our setup results in taking down some ES nodes or even the entire cluster. We feel safe in posting about this publicly because Quid does not expose ES API to the world; Quid software generates well formed queries on the user’s behalf via our search DSL, and would not allow a query like this to proceed. However, people who run ES as an open installation should beware!
We found a few discussions around this where folks ask how to prevent ES cluster from crashing on deep aggregations here and here, with the advice being basically to avoid those and go breadth-first instead. However, if a clueless or malicious actor did launch a query like that against your ES, wouldn’t you want to be able to terminate it? Looks like currently you can’t, and it is therefore impossible to prevent ES from going into a tailspin when met with a ultra high cardinality query. Scary… yet awesome for one’s Game Day activities!
If the esteemed reader happens to know of a way to lock down ElasticSearch against this failure mode, we would love to learn the trick.
What We Learned
We had a number of takeaways from the postmortem for improvements to our processes at Quid.
The big finding was, perhaps predictably, that everyone has different levels of exposure to internal tooling and debugging experience, and therefore you cannot assume that all your people know how to look up thing X to figure out condition Y. For example, in our case most engineers were skilled in only a subset of tools and systems we use and weren’t always aware of the right dashboards, logs or events used for diagnosis.
It takes practice! This understanding now informs our policy for setting up on-call pairs to have complementary knowledge, as well as our desire to run more Game Days in the future.
Other things we discussed as an engineering organization in Game Day postmortem:
- Which alerts are useful and which are noise.
- What metrics aren’t you recording that could have helped you catch an error?
- What areas of your documentation are lacking.
- What topics you really need to run an organization wide training on.
- Being on-call itself requires some training and definition of roles, responsibilities, and standard operating procedures during incidents.
We also learned a few things about better organization of Game Day:
- Have an extra trick or two up your sleeve for creating failure conditions. One spectacular incident scenario, which had worked great for wreaking havoc in our test environment, had no effect whatsoever when we tried it during Game Day.
- Leave extra time for writing incident reports.
- Next time, consider running incidents individually for each participating team, so they can practice resolution without stepping on anyone else’s toes. This would take longer, though, so it might turn Game Day into a Game Week.
We plan on doing this again, and so should you! May the odds be ever in your favor.
Interested in helping us solve awesome problems? If so, then head over to our careers page!