Chaos Engineering and Security: Upgrading Simulation Exercises For More Dynamic Threat Environments
As the recent pandemic has swept the globe, malicious hackers have quickly pivoted to leverage the confusion to their benefit in carrying out cyberattacks. On April 8, 2020, the United States Department of Homeland Security (DHS) Cybersecurity and Infrastructure Security Agency (CISA) and the United Kingdom’s National Cyber Security Centre (NCSC) jointly released an alert providing “… information on exploitation by cybercriminal and advanced persistent threat (APT) groups of the current coronavirus disease 2019 (COVID-19) global pandemic.” The growth in remote working has increased the attack surface, potentially adding tens of millions of virtual private networks (VPNs) and home WiFi routers to the mix.
These clever playbooks are only the latest example of how quickly threat environments are evolving. APTs, organized criminal gangs and other threats are releasing and constantly evolving their tactics, techniques and procedures (TTPs) to stay one step ahead of information security defense teams.
Because the threat environment has become more dynamic, information security (infosec) simulations used to help security teams prepare for the possibility of an attack need to keep up with this high pace of change. While extensive, complex and engaging, these simulations historically have tended to be relatively static. They are typically rooted in actual attacks, but they have a limited repertoire. Infosec simulations need to be made more flexible and adaptable to specific industry verticals, company security postures and specific risk profiles.
To bring more dynamic simulations and “war games” to cyber ranges, simulations need to be made more dynamic and customizable. Humans cannot be expected to deal effectively with threats that are changing faster and faster unless the training to stop those threats is evolving as quickly to capture the latest TTPs, ideally almost as soon as they are identified as indicators of compromise (IOCs) in the wild.
While immersive simulation in a physical cyber range is useful, these types of simulations must also capture the chaos of the real world and the unexpected failures or actions that often seem to characterize cyber incidents. Replicating the chaotic environments of the real world as a model for simulation is not unprecedented in the world of technology.
Ten years ago Netflix pioneered a new type simulation methodology called chaos engineering. Beyond running simple QA checks or static code analysis, chaos engineering systematically breaks or overloads all the key parts of an application environment one by one to see what happens. More advanced exercises break multiple parts and follow specific sequences of induced failures to simulate cascading failures. In that sense, chaos engineering is actually not chaotic but rather systematic.
Overall, chaos engineering closely controls the environment and limits the “blast radius.” Why it works so well, however, is that it radically expands the flexibility of stress testing and also helps enable teams to identify many more potential points of failure in applications. This type of dynamic but controlled chaos is what has helped make the current generation of cloud native applications like Netflix and Amazon Web Services so resilient and generally reliable.
There are lessons from chaos engineering that could easily benefit information security simulation exercises and elevate the practice to deliver radically more value to immersive exercise participants on cyber ranges.
To name one improvement, if chaos engineering principles are to be applied, then simulation environments must be able to almost immediately import and run the latest attacks from new U.S. CERT-Alerts. For example, in the April 8 advisory referred to above, CISA lists in specific detail multiple ways bad actors were seeking to exploit the COVID-19 crisis. This added a new layer of chaos to the lives of security practitioners. And, it demonstrates how we can add more chaos to simulations: by injecting more and more current situations based on real-world events and dynamics.
The available catalog of simulations that can run must be dramatically increased as well. This would mean that a red or blue team participating in exercises could literally run a different simulation and live response exercise each week and never do the same exercise twice. Even better would be adaptive simulations that can adjust TTPs in response to team behaviors. This would be particularly useful for attacks where we want to simulate how to handle horizontal penetration and proliferation once an environment has suffered a breach and a bad actor has implanted active agents inside the firewalls on compromised devices.
The number of documented attacks is already so numerous that this is not idle speculation. The SafeBreach Hacker’s Playbook contains more than 13,000 attack playbooks captured from real attack activities in the wild. All of these attacks are well described and are part of programmatic breach-and-attack simulations that run in automated fashion. The work that needs to be done is taking these attacks and creating immersive, human-centric simulations that allow teams to respond and react to attacks as they happen, something that you do not get in machine-versus-machine adversarial testing. This requires a different set of skills, more like video games, and it relies heavily on designing for human psychology.
Another way to make cyber range exercises more useful is to make the chaos more relevant. Attack types and simulations must mimic the real-world infrastructure, software environment and industry characteristics of participating teams. For example, a business-to-business (B2B) software company team may not benefit from drilling against a breach of PII data and credit cards as much as a breach of key bank details, to give one simple example. A company that has key data stored in Amazon’s S3 storage service may not benefit as much from simulations replicating an attack against an Oracle database.
Lastly, the chaos should be systematic, programmatic and portable. Teams must be able to practice this training in groups and individually. When possible, they should compete against other internal teams or even teams from around the world on a regular basis to validate their skills and win virtual competitions. Playing simulation games would become a regular part of their job just like other duties.
To this end, companies should put in place structures and techniques to support the ongoing use of simulation to consistently test and drill against real life threats and exploits. This is the future of effective simulation and cyber ranges. Chaos is now a global constant, in life and in cybersecurity. The best way to deal with chaos is to train against chaos — and have fun in the process.