CrowdStrike update chaos explained: What you need to know

Thank you for reading this post, don't forget to subscribe!

On Friday 19 July 2024, the UK awoke to news of a fast-spreading IT outage, seemingly global in its nature, affecting hundreds – if not thousands – of organisations.

The disruption began in the early hours of Friday morning in Australia, before spreading quickly across Asia, Europe and the Americas, with the travel industry among the most widely affected.

The outage was quickly tracked to cyber security firm CrowdStrike, which is already engaged in incident response amid the chaos. Keep on top of this developing incident over the coming days and weeks with our Essential Guide.

What does CrowdStrike do?

CrowdStrike is one of the world’s most prominent cyber security companies, with thousands of customers all over the world. Based in Texas, it employs more than 8,000 people and books about $3bn in revenues per annum. It has been around since 2011.

The organisation bills itself thus: “CrowdStrike has redefined security with the world’s most advanced cloud-native platform that protects and enables the people, processes and technologies that drive modern enterprise. CrowdStrike secures the most critical areas of risk – endpoints and cloud workloads, identity, and data – to keep customers ahead of today’s adversaries and stop breaches.”

CrowdStrike will be unfamiliar to most people not steeped in the technology industry, although Formula 1 fans will be aware of it thanks to its headline sponsorship of the Mercedes AMG Petronas team – its branding appears on the halo safety device and is clearly seen on onboard footage from Lewis Hamilton’s car.

Security practitioners will know CrowdStrike from its frequent contributions to major incident investigations, including the Sony Pictures hack, the WannaCry crisis, and the 2016 hack of the Democratic National Committee by Russia.

What happened during the CrowdStrike outage?

The disruption at first manifested in the form of the infamous blue screen of death – which signals a fatal system error – on Windows PCs.

Given the disruption appeared to be a Microsoft problem to begin with, it was Redmond that first responded, confirming just before 8am BST that it was investigating problems affecting cloud services in the US.

It quickly became apparent that the issue was not down to Microsoft itself, but rather a faulty channel file rolled out to CrowdStrike’s Falcon sensor product.

Falcon is a solution designed to prevent cyber attacks by unifying next-gen antivirus, endpoint detection and response (EDR), threat intelligence and threat hunting, and security hygiene. This is all managed and delivered through a lightweight, cloud-delivered and -managed sensor, which seems to be whence the issue arose.

The botched roll-out effectively caused what is known as a boot loop. This is a situation that occurs when a Windows device restarts without warning during its startup process – meaning the machine cannot finish a complete and stable boot cycle and, therefore, won’t turn on.

At the time of writing, the full facts of the incident have not been fully established, and an investigation will likely take some time.

However, such issues will in general occur either due to inadequate testing across various desktop and server environments, or due to a lack of proper sandboxing and rollback mechanisms for updates that involve a kernel-level interaction.

Is there a cyber security threat from the CrowdStrike outage?

Though similar in its effect and origins to a supply chain attack, it is important to note that the CrowdStrike outage is not a cyber security incident and nobody is known to be under attack as a result of it.

However, as it affects a cyber security product there is a chance that threat actors may seek to take advantage of the downtime caused and any gaps in coverage arising.

Almost certainly, the coming days and weeks will see threat actors exploiting the incident in phishing and social engineering attacks as they attempt to lure new victims. Potential lures could include offers of technical support or bogus CrowdStrike updates, and the consequences could include data exfiltration, ransomware deployment and extortion.

Security and IT leaders and admins would be well-advised to communicate the potential follow-on dangers to their users.

Who was affected by the CrowdStrike outage?

The full number of organisations affected by the outage is not known for now. However, those that are known to have, or have confirmed they have, experienced some impact include:

Airlines including American Airlines, Delta, KLM, Lufthansa, Ryanair, SAS and United;
Airports including Gatwick, Luton, Stansted and Schiphol;
Financial organisations including the London Stock Exchange, Lloyds Bank and Visa;
Healthcare including most GP surgeries and many independent pharmacies;
Media organisations including MTV, VH1, Sky and some BBC channels;
Retailers, leisure and hospitality organisations including Gail’s Bakery, Ladbrokes, Morrisons, Tesco and Sainsbury’s;
Sporting bodies including F1 teams Aston Martin Aramco, Mercedes AMG Petronas and Williams Racing, all competing on the weekend of 20 and 21 July at the Hungarian Grand Prix, and the Paris 2024 Organising Committee for the Olympic and Paralympic Games, which begin on 26 July;
Train operating companies (TOCs) such as Avanti West Coast, Merseyrail, Southern and Transport for Wales.

What is CrowdStrike saying about the outage?

In an initial statement, CrowdStrike CEO George Kurtz said: “CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyber attack.

“The issue has been identified, isolated and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website.

“We further recommend organisations ensure they’re communicating with CrowdStrike representatives through official channels. Our team is fully mobilised to ensure the security and stability of CrowdStrike customers.”

In a breakfast TV interview with NBC in the US, Kurtz added: “We’re deeply sorry for the impact that we’ve caused to customers, to travellers, to anyone affected by this, including our companies.”

Microsoft’s full statement, shared with the BBC and attributed to a spokesperson, reads: “We’re aware of an issue affecting Windows devices due to an update from a third-party software platform. We anticipate a resolution is forthcoming.”

Can I fix the CrowdStrike problem myself?

CrowdStrike has rolled back the changes to the affected product automatically, but hosts may continue to crash or be unable to stay online to receive the remedial update.

The short answer to the question is yes, but unfortunately, such issues can be daunting to fix, requiring IT teams to put in a lot of work. It may be days, or even longer, before all the affected devices can be reached.

System administrators are advised to take the following steps:

Boot Windows into safe mode, or the Windows Recovery Environment;
Navigate to C:\Windows\System32\drivers\CrowdStrike directory;
Locate the file matching “C-00000291*.sys”. Delete this file;
Boot normally.

CrowdStrike customers can access more information by logging into its support portal.

How can I avoid similar problems in the future?

Security firms such as CrowdStrike are under a great deal of pressure when it comes to product development and updates, which must be done frequently as they strive to keep their customers protected from new zero-days, ransomware and the like.

This pressure also trickles down to customers themselves, who will understandably often want to take advantage of settings to allow their security tools to update automatically.

To avoid falling victim to this kind of problem going forward, IT teams should consider taking a phased approach to software updates – particularly if they pertain to security solutions – and test them in a sandbox environment, or on a limited set of devices, prior to full deployment.

It is also wise to have some level of system redundancy built in to properly isolate and manage fault domains, particularly when running critical infrastructure.