Last Friday, as I was sipping my morning coffee, my mother, a seasoned salesperson, called with an urgent question: “Were you affected by CrowdStrike?” Her query highlighted a massive outage, now labeled the largest software outage in history. This incident disrupted countless businesses and reminded us of our reliance on technology.
What Happened?
CrowdStrike, a company providing security software, released an update for their Falcon Sensor, a program meant to protect computers from cyber threats. However, this update had a bug that caused many Windows computers to crash and display the “Blue Screen of Death” (BSOD). BSOD is a severe error that causes a computer to stop working completely, showing a blue screen with error messages. Usually, updates fix issues without such dramatic problems. This time, the BSOD prevented remote fixes, requiring technicians to manually restart each computer in a special mode and run specific scripts to get them working again, causing major disruptions for businesses.
Personal Experience With CrowdStrike
Our history with CrowdStrike is a tale of missed opportunities and persistent challenges. Despite our extensive experience with the product, we faced significant hurdles in establishing a reseller relationship. Calls went unanswered for weeks, leading us to lose a major cyber deal. Eventually, we were granted permission to resell, but not manage, the product—a far cry from our initial hopes. One night, I came home and shared my concerns with my wife, feeling something was off about this deal. Little did I know how significant this would become.
The Issue and Its Fix
The issue was a seemingly simple update that went awry, causing the dreaded BSOD and rendering Windows 11 systems inoperable. The fix required booting to safe mode and manually deleting a problematic folder—impractical for large-scale deployments like airport terminals. This incident serves as a reminder that no system is infallible, and human error in the development process can have far-reaching consequences.
Reflection and Preparation
Reflecting on this event, I recall reaching out to my rabbi, wondering if I should say the Gomel blessing, typically recited after surviving a life-threatening situation. While not appropriate here, the sentiment was real—had we gone forward with CrowdStrike, our summer rollouts would have been catastrophic. Ironically, around the same time, I discussed with my team lead, Tzvi Feygin, a hypothetical situation: “What if all our operating systems got corrupted? What’s our plan?” Tzvi had already developed a cloud-based solution allowing us to remotely reinstall Windows OS, reassuring us that we were ahead of the curve.
Blame and Responsibility
People often ask, “Why blame Microsoft for this?” Imagine going to a five-star steakhouse and ordering a $400 steak that comes out burned. Both the chef and sous chef missed the mistake, but in the end, the waiter gets blamed. Similarly, both CrowdStrike and Microsoft had roles in this failure, but customers unjustly hold us accountable. Unfortunately, customers are unforgiving. I’ve experienced clients reading about ransomware strikes and questioning our foresight and solutions. For every business impacted, they would have blamed us, questioning our preparedness.
Conclusion
It’s unfortunate that Microsoft has taken the hit here. They allow software developers to interact with their operating system after rigorous testing. This is why operating systems like Apple and Android limit the number of available vendors to prevent potential OS corruption. However, this also means missing out on some excellent software. The key takeaway is to test new software rigorously before full deployment. We practice this diligently, ensuring thorough testing on individual computers before wider rollout. Mistakes happen, but being prepared is crucial.
On-site presence remains essential, and while we manage everything via the cloud, we are always ready for any eventuality.
The moral of this story is about preparing for and effectively managing mistakes when they occur, emphasizing the importance of robust contingency plans and resilient IT infrastructures.
Shneur Garb is the CEO of The Garb Cloud Consulting Group LLC