The Crowdstrike and Microsoft Debacle
Could the colossal failure of CrowdStrike's Falcon platform that took down over 8 million Windows servers and caused severe damage to computer systems around the world have been prevented with basic software engineering practices including a better design, good QA and staged roll outs?
Poor Design
The bug was caused by a threat config file update in CrowdStrike’s Falcon platform that would cause the Falcon software to access invalid memory. The config file is often updated to make the platform aware of new security threats - sometimes it’s updated multiple times a day.
Since the Falcon software runs at the OS level as a driver, the invalid memory access would cause the machine to crash and reboot. Upon reboot normally the bad driver code would be ignored, but not in this case - the Falcon software was deemed a ‘Critical Driver’ and so upon each restart, the OS would try to run it again.
Preventing hacker malware is important - but is it more important than running the system at all?
Instead a more robust design would allow for the system to reboot safely without CrowdStrike in the mix or at a minimum, give admins the ability to configure this as an option remotely.
Microsoft's Windows weak OS design, which has been an issue for years, is at the core of this issue. It is important to note that many other machines Linux, Apple and other platforms were unaffected.
In this particular case this update was specific to the Microsoft OS. In other cases however, Unix based OS have a better design in place that would protect it from the catastrophic doom loop that effected Microsoft Windows machines.
There seemed to be a lack of basic QA
Often software bugs can be very tricky only appearing in certain rare edge cases. These types of bugs are very difficult to catch and test. This was not the case. In this case, with over 8 M computers effected, it seemed to be happening every time. Some basic QA process should have caught this.
My guess is because it was a new threat configuration file, not necessarily new code, CrowdStrike engineers assumed it would be fine and this scenarios was not tested adequately.
It used to be enterprises would prevent new updates into production environments without significant testing first, but with the evolution of the internet, SAAS and Cloud, and new security threats popping up every day, automatic updates have become common place. There needs to be a rebalance of automatic updates vs proper testing.
A Staged Rollout Strategy would have been a better strategy
When deploying critical systems, it's wise to release new code in a smaller, controlled environment first. This helps identify and fix any bugs that could cause catastrophic consequences before a wider rollout.
This is a lesson we learned early on at ICE. We adapted by rolling out major updates to certain smaller markets first before rolling it out to the systems that managed Global Oil trading for example.
If there was a bug in the initial rollout we could pause, fix it and then try again and the other markets would never have been affected.
I’m curious what others think could have made a difference? Also if you were effected by the outage at your company, are you considering other alternatives or are you sticking with CrowdStrike?