No idea if it's already posted or not (I don't load embeds) but an interesting video from retired MS dev Dave Miller on the outage:
Skip to 07:30 if you just want to see the explanation/speculation WRT crowdstrike.
For those of you, like me, for whom youtube is instant TLDW but with most of the nerdiness preserved (and some hopefully coherent explanations from me for the non-nerdy who are still interested nonetheless):
- As with almost all security products, significant portions of crowdstrike run in kernel mode (the central nervous system of an operating system if you will), both to be able to see what anything else on the system is doing and to also protect themselves from malicious code running in user space that might try to disable the security products
- Crowdstrike's kernel driver is signed by Microsoft - basically a lengthy approvals process undertaken by MS themselves that says "yup, windows is allowed to load this as a kernel driver, we've checked it and it won't crash the system", authenticated with cryptographic signatures
- Crucially, the signed driver is also capable of loading unsigned code in to itself via an update process - presumably so it can update portions of itself with code from the mothership without having to go through the approvals process again
- Key failure of the driver here seems to be it executaing a new batch of this code, only to find it referencing a null pointer - basically an invalid address in memory. This is a bit akin to saying "deliver this parcel to 123 Acacia Avenue", when Acacia Avenue only goes up to 12. The driver can't proceed, the kernel and thus the whole OS crashes.
- They've apparently looked at the file that this code was meant to be loaded from, and instead of being a genuine binary it appears to be just a file full of zeroes. The kernel driver apparently doesn't have any/enough input validation to detect when it's about to ingest and execute bogus code (or in this instance, no code at all) so it blindly slurps up the zeroes, tries to execute them, and fails. This is what's made it so horribly brittle.
- I lied a bit earlier in suggesting a crash of a kernel mode driver would always crash the OS. Greybeard gamers amongst you are probably familiar with modern windows being able to recover from a graphics card driver crash in a way that old windows didn't - your screen goes black and the game dies, but the OS stays up and if you're lucky there'll be a nice crash report you can forward to the manufacturer. So yes, windows these days does have the ability to recover from a crash of at least some kernel drivers.
- That didn't/couldn't happen here though because the crowdstrike driver is flagged as "boot-critical", basically a "you're NOTHING without ME!". This is expected for security software - else you could just find a way of crashing the driver somehow, then do your dirty work whilst the driver is re-initialising itself. Crucially though this means if it crashes then the whole of windows goes with it via a BSOD.
- Crowdstrike presumably tested this update file somewhere before pushing it out to the wider world as an update... I find it hard to believe that this could possibly have passed even the most basic automated testing pipeline.
- If they did test it, presumably something went horribly wrong with the distribution mechanism, perhaps silently overwriting the file with zeroes somehow.
There's a failure on both parties here IMHO; primarily on crowdstrike for allowing their update file to basically be invalid code at some point, but in order to sign the driver MS get the source code and would have been able to see it was capable of loading arbitrary code from a file (from t'internet) and executing it in kernel mode without proper safeguards to check whether it's actually valid code. In my habituation of Paranoiaville, population me, that'd be an instant no as far as driver validation goes - returning to our Acacia Avenue analogy, it's like looking up the address on a map only to find out you didn't actually pick up a map, but instead a rabid mongoose wearing a Reform rosette.
Crucially, as I may have mentioned earlier, there doesn't seem to have been any way for admins to defer installation of such critical driver files - from the sounds of things they were treated as part of the regular antivirus definition files and thus were slurped up by machines as soon as they were enabled (software like this usually checks for updates in the background every hour or so). As soon as that empty driver file hits your disk, you're stuck with an unbootable machine, and if the driver actually tries to load the code in realtime instead of waiting for a reboot to load... you get the instant BSOD as soon as the update executes. Explains why it managed to hit so many machines so quickly.
I've seen some suggestions that the earlier Azure outage may have meant people at MS may not have noticed a null file going though, but I'm almost certain that if such a test existed it'd be automated, so I think there's a degree of asleep at the wheel on both sides here.
More in my usual realm but out of my direct experience - apparently crowdstrike also caused crashes of a bunch of linux machines not a million years ago with a dodgy kernel module in their software (apparently they offer two versions of the product - one using a kernel module/driver, and one using EBPF which, as mentioned earlier, helps mitigate dodgy modules from crashing the OS) but from the look of this
RedHat page not enough to prevent a full-on kernel panic (linux equivalent of a BSOD) even with EBPF. A rather more accessible
Debian bug report on the same issue. The saving grace here is that linux usually keeps at least one copy of your previous kernel+drivers kicking around, so you can boot to an earlier version of kernel+drivers fairly trivially, so this didn't generate much in the way of news when it happened (combined with the relatively low prevalence of products like crowdstrike in the linux world).