Urban75 Home About Offline BrixtonBuzz Contact

Massive worldwide IT outage, hitting banks, airlines, supermarkets, broadcasters, etc. [19th July 2024]

No idea if it's already posted or not (I don't load embeds) but an interesting video from retired MS dev Dave Miller on the outage:


Skip to 07:30 if you just want to see the explanation/speculation WRT crowdstrike.

For those of you, like me, for whom youtube is instant TLDW but with most of the nerdiness preserved (and some hopefully coherent explanations from me for the non-nerdy who are still interested nonetheless):
  • As with almost all security products, significant portions of crowdstrike run in kernel mode (the central nervous system of an operating system if you will), both to be able to see what anything else on the system is doing and to also protect themselves from malicious code running in user space that might try to disable the security products
  • Crowdstrike's kernel driver is signed by Microsoft - basically a lengthy approvals process undertaken by MS themselves that says "yup, windows is allowed to load this as a kernel driver, we've checked it and it won't crash the system", authenticated with cryptographic signatures
  • Crucially, the signed driver is also capable of loading unsigned code in to itself via an update process - presumably so it can update portions of itself with code from the mothership without having to go through the approvals process again
  • Key failure of the driver here seems to be it executaing a new batch of this code, only to find it referencing a null pointer - basically an invalid address in memory. This is a bit akin to saying "deliver this parcel to 123 Acacia Avenue", when Acacia Avenue only goes up to 12. The driver can't proceed, the kernel and thus the whole OS crashes.
  • They've apparently looked at the file that this code was meant to be loaded from, and instead of being a genuine binary it appears to be just a file full of zeroes. The kernel driver apparently doesn't have any/enough input validation to detect when it's about to ingest and execute bogus code (or in this instance, no code at all) so it blindly slurps up the zeroes, tries to execute them, and fails. This is what's made it so horribly brittle.
  • I lied a bit earlier in suggesting a crash of a kernel mode driver would always crash the OS. Greybeard gamers amongst you are probably familiar with modern windows being able to recover from a graphics card driver crash in a way that old windows didn't - your screen goes black and the game dies, but the OS stays up and if you're lucky there'll be a nice crash report you can forward to the manufacturer. So yes, windows these days does have the ability to recover from a crash of at least some kernel drivers.
  • That didn't/couldn't happen here though because the crowdstrike driver is flagged as "boot-critical", basically a "you're NOTHING without ME!". This is expected for security software - else you could just find a way of crashing the driver somehow, then do your dirty work whilst the driver is re-initialising itself. Crucially though this means if it crashes then the whole of windows goes with it via a BSOD.
  • Crowdstrike presumably tested this update file somewhere before pushing it out to the wider world as an update... I find it hard to believe that this could possibly have passed even the most basic automated testing pipeline.
  • If they did test it, presumably something went horribly wrong with the distribution mechanism, perhaps silently overwriting the file with zeroes somehow.

There's a failure on both parties here IMHO; primarily on crowdstrike for allowing their update file to basically be invalid code at some point, but in order to sign the driver MS get the source code and would have been able to see it was capable of loading arbitrary code from a file (from t'internet) and executing it in kernel mode without proper safeguards to check whether it's actually valid code. In my habituation of Paranoiaville, population me, that'd be an instant no as far as driver validation goes - returning to our Acacia Avenue analogy, it's like looking up the address on a map only to find out you didn't actually pick up a map, but instead a rabid mongoose wearing a Reform rosette.

Crucially, as I may have mentioned earlier, there doesn't seem to have been any way for admins to defer installation of such critical driver files - from the sounds of things they were treated as part of the regular antivirus definition files and thus were slurped up by machines as soon as they were enabled (software like this usually checks for updates in the background every hour or so). As soon as that empty driver file hits your disk, you're stuck with an unbootable machine, and if the driver actually tries to load the code in realtime instead of waiting for a reboot to load... you get the instant BSOD as soon as the update executes. Explains why it managed to hit so many machines so quickly.

I've seen some suggestions that the earlier Azure outage may have meant people at MS may not have noticed a null file going though, but I'm almost certain that if such a test existed it'd be automated, so I think there's a degree of asleep at the wheel on both sides here.

More in my usual realm but out of my direct experience - apparently crowdstrike also caused crashes of a bunch of linux machines not a million years ago with a dodgy kernel module in their software (apparently they offer two versions of the product - one using a kernel module/driver, and one using EBPF which, as mentioned earlier, helps mitigate dodgy modules from crashing the OS) but from the look of this RedHat page not enough to prevent a full-on kernel panic (linux equivalent of a BSOD) even with EBPF. A rather more accessible Debian bug report on the same issue. The saving grace here is that linux usually keeps at least one copy of your previous kernel+drivers kicking around, so you can boot to an earlier version of kernel+drivers fairly trivially, so this didn't generate much in the way of news when it happened (combined with the relatively low prevalence of products like crowdstrike in the linux world).
 
That's the best description I've seen, ta :)

And for the video summary, great idea I generally much prefer reading stuff than listening. eBPF is apparently sandboxed which makes it safer. :thumbs:
 
I'm impressed how quickly real people were about to get around so many real boxes and fix them. I appreciate the fix itself doesn't sound overcomplicated but nonetheless
 
As you say the fix doesn't sound too complicated - I'm a bit surprised that more people didn't do the fix themselves rather than have IT come round. Letting them know what to do might be a bit more difficult though, a different form of bootstrapping unless the companies also have personal emails?
 
As you say the fix doesn't sound too complicated - I'm a bit surprised that more people didn't do the fix themselves rather than have IT come round. Letting them know what to do might be a bit more difficult though, a different form of bootstrapping unless the companies also have personal emails?
Because in a corporate environment, it is often not possible to even interrupt the boot process and/or boot from a different device without passwords and privileged access.
 
which would have to be notified by phone or email? I did read that it could be a password that would only be valid for a few minutes after regaining access but I suppose even that would be enough for a hacker to take over an account if they were faster than the employee.
 
which would have to be notified by phone or email? I did read that it could be a password that would only be valid for a few minutes after regaining access but I suppose even that would be enough for a hacker to take over an account if they were faster than the employee.
Yes, it could be, although I imagine they'd want to do a later round of password changes to restore security afterwards.

I think you might be overestimating the average office worker's ability to perform the (comparatively simple) steps, once in, though. Have you done much first-line tech support? :D
 
Key failure of the driver here seems to be it executaing a new batch of this code, only to find it referencing a null pointer - basically an invalid address in memory. This is a bit akin to saying "deliver this parcel to 123 Acacia Avenue", when Acacia Avenue only goes up to 12. The driver can't proceed, the kernel and thus the whole OS crashes.
To add some detail here, the kernel driver (CS) attempts to access an illegal memory location. The kernel says "Wait, I'm the kernel. If I don't know what memory is safe then nothing is safe." If the kernel continues to run at this point, data corruption is almost assured. So it crashes - as it should. The crash is to protect your data. This is why 3rd party kernel drivers are such a bastard, but there are legitimate times you'd really want things in there and they're for either security or you have something that needs every drop of performance (GPU drivers are kernel mode). As reported elsewhere, they did put out an update that crashed Linux systems as well a few months ago, but most installs have moved on to a system where the kernel gets replicated into a container and the Crowdstrike agent works against that instead. Upside being that it's difficult (though not outright impossible) to crash the kernel that way, but the tradeoff is that CS pretty much has to halt the system if it finds something whereas it can actively remove it if it's running in the real-time kernel.
 
At the end of the day, it's like pretty much all of these types of events. It boils down to unchecked human error, and there are no excuses for it, apart from laziness and greed, with the scales usually tipping to the latter.
I have no doubt that Crowdstrike have checks and procedures in place that would/should have prevented this from happening, had they been followed, but they'll cost the company, so they'll probably have been bypassed for a healthier bottom line.
The blame will likely fall squarely at the door of a programmer, or a faulty system that can't argue back, as they'll need a scapegoat, and although it was obviously programmer error, critical code like this should never have been released into the wild without undergoing rigorous testing.
 
I did a brief stint as consultant after 20 years the other side. I was offered a gig at the company who handle card payments across the Danish retail eco system.
They were not pci compliant.
They had 3 days to be pci compliant.
They had been certified as pci compliant by another auditor.
I took one look at the task and told them to contact their lawyer.
 
Similarly when asked to look into US export restriction. A set of impossible tasks that even the US Military couldn’t even achieve.
Worked on projects to develop operating systems compliant to those requirements. Again a monumental waste of time.
 
At the end of the day, it's like pretty much all of these types of events. It boils down to unchecked human error, and there are no excuses for it, apart from laziness and greed, with the scales usually tipping to the latter.
I have no doubt that Crowdstrike have checks and procedures in place that would/should have prevented this from happening, had they been followed, but they'll cost the company, so they'll probably have been bypassed for a healthier bottom line.
The blame will likely fall squarely at the door of a programmer, or a faulty system that can't argue back, as they'll need a scapegoat, and although it was obviously programmer error, critical code like this should never have been released into the wild without undergoing rigorous testing.
Look into the internet chatter about working conditions there.
This wasn’t human error. This was human fuck you
 
The register an IT news letter has a reasonably readable analysis of how it went wrong that's worth a read here. And an article based on information from CrowdStrike about how the relase made it into the wild here
Interesting, ta, and:

CrowdStrike has blamed a bug in its own test software for the mass-crash-event it caused last week.
I have the same problem with software that a mate remarked on ooo 45 years ago with airplanes: "it's not that they work so well that confuses me, it's that they work at all"
 
Look into the internet chatter about working conditions there.
This wasn’t human error. This was human fuck you
I haven't been following it but if that's the case, I hope it was deliberate.
The register an IT news letter has a reasonably readable analysis of how it went wrong that's worth a read here. And an article based on information from CrowdStrike about how the relase made it into the wild here
A bug in the test software :D Did the test software forget to install the code on an actual machine in a test environment. Or smoke test it before release.
Playing the blame game isn't how you do good business. Taking responsibility for your fuck up should be priority one. They don't seem to have done that. Have they even apologised?
 
teuchter .

What’s been described as the largest IT outage in history will cost Fortune 500 companies alone more than $5 billion in direct losses, according to one insurer’s analysis of the incident published Wednesday.

 
Last edited:
teuchter .

What’s been described as the largest IT outage in history will cost Fortune 500 companies alone more than $5 billion in direct losses, according to one insurer’s analysis of the incident published Wednesday.


If an event doesn’t personally affect him it doesn’t happen

Hes the forest that’s to dense to hear a tree fall
 
teuchter .

What’s been described as the largest IT outage in history will cost Fortune 500 companies alone more than $5 billion in direct losses, according to one insurer’s analysis of the incident published Wednesday.


It's time to stop with this silliness. So what if it's the "largest IT outage in history"?

What I have challenged is the idea that it caused "absolute chaos across the world".

Did it cause "absolute chaos across the world"? No, it did not. It caused moderately widespread disruption to certain services for a short period of time. And posting up more poor-quality news articles is not going to change this.

How it affected me is entirely irrelevant. If it had caused my house to burn down, and all of my fingerw to fall off, would that equal "absolute chaos across the world"? No it would not.
 
Back
Top Bottom