IT Disaster - Global Outage. All CrowdStrike's Fault?

Written by Alvin | Jul 29, 2024 4:00:00 AM

Discover how a faulty software update led to a global IT disaster, affecting industries worldwide. Explore cloud architecture strategies to enhance business continuity and prevent future outages. Join the conversation on improving system design and resilience in an increasingly cloud-centric world!

Introduction

You must have heard about the hottest IT topic recently - the global system outage! 🤨

On July 19, 2024, CrowdStrike released a faulty configuration update that caused major system crashes (blue screens) on Windows systems worldwide.

The impact was widespread, affecting multiple industries including aviation, banking, and media, resulting in severe business disruptions.

During this time, half of the IT folks were busy putting out fires and fixing systems, while the other half were making memes and jokes about it. 🤨

Of course, I couldn't resist joining in on the fun!

It's so easy to just point fingers and say it's all CrowdStrike's fault. 🤨

Just like many affected IT departments who never take responsibility for anything. 😤😤

How convenient it is to have someone to blame!

Jokes aside, let me analyze this incident from a cloud architect's perspective and system design point of view.

There are definitely some interesting insights to be gained here, so make sure to read until the end!

As a responsible cloud architect, it's important to review and examine our approach.

We need to identify if there are any blind spots in our traditional design thinking.

By learning from this incident, we can prevent similar issues from happening in future designs.

Interestingly, even major international airlines couldn't escape this disaster.

They surely have excellent system and cloud architects, so why did they still get burned? 🤨

Hi, I'm Alvin, an IT Cloud Technology Coach and Life Coach. I help fellow tech enthusiasts navigate their path in cloud technology.

I also help myself build an ideal IT career that aligns with my interests, achieving work autonomy and time freedom.

Let me break down a cloud architect's thought process and design approach from an outsider's perspective.

Cloud architects need to consider many aspects, but today we'll focus on just one key point.

That is Business Continuity.

Architects will do everything possible to ensure systems stay operational to support the business.

Design 1: Basic System Design

The simplest system design only needs one server and one database in the backend to serve the frontend terminals.

Design 2: High Availability System Design

As architects, we worry about any server or database that might fail to operate properly.

That's why we implement High Availability design.

This means running two servers simultaneously with a Load Balancer in front to distribute the workload.

When one server experiences issues, the other server takes over all the work.

For databases, we have a backup ready.

When the primary database has issues, the backup database is ready to go.

It has the most up-to-date data from the primary database and steps in to continue operations.

Design 3: Multi-AZ System Design

As architects, we thought further - what if the Data Centre itself has issues?

So, we distribute the architecture across different Availability Zones (AZ).

In the cloud world, AZ means Data Centre.

For example, AWS has 3 AZs in Hong Kong and 2 AZs in Singapore.

In this architecture, if we place it in two of Hong Kong's AZs.

Even if the Hong Kong Island AZ has issues, the New Territories AZ can still maintain service.

Business continuity has reached a very high level.

Design 4: Multi-Region System Design

Architects with too much time and budget on their hands will think even further.

For large international enterprises that need 24/7 system operation,

We cannot afford any risk of downtime.

What if all AZs in Hong Kong fail? That would be disastrous!

In this case, architects will design for Multi-region.

This means setting up the same system infrastructure in Singapore as we have in Hong Kong.

When Hong Kong's infrastructure becomes inoperable, Singapore's systems will take over to maintain service.

(This description skips many details, but let's move on for now.)

Design 5: Multi-planet System Design

You can imagine that this architecture can be infinitely scaled up.

If budget is not a constraint, architects can add more Regions.

And if architects are still not satisfied, they can extend the architecture to Mars. 🤨

When Earth explodes, Mars will take over, but... 🤨

... where are the users?? ... quiet as outer space ...

From the logic above, you can see the underlying thought process.

It's all about avoiding Single Point of Failure.

(Hey, you've been talking for so long, when are you getting to the point? We've already gone to space! 🤨)

We're getting there, we're getting there! The groundwork is complete.

From our previous logic, we take a macro approach with multiple backups.

This is to solve Single Point of Failure as much as possible.

But unexpectedly, this time the problem was within every server and endpoint computer.

No matter how many backups we had, it wouldn't help!

I call this variant: Distributed Single Point of Failure.

(Just having a little fun with the name 🤨)

Conclusion

From a macro perspective, it's difficult to notice the possibility of this issue.

However, when we look at it from a micro perspective, we discover something new.

Learning from this lesson, how should we review our design next time?

Should we avoid using security software from a single vendor? (Reduce CrowdStrike usage, add other brands?)
Should we avoid using a single OS? (Consider using Linux, macOS?)
Check if we're using other critical software?
Avoid updating all OS and software simultaneously?
Perform comprehensive daily backups including OS and all systems?
Avoid using a single cloud service provider? (Use Multi-cloud?)
Maintain on-premise systems as backup instead of going fully cloud?
Strategically prepare manual operations for worst-case scenarios?

Business Continuity is a crucial aspect of Information Security.

Different methods suit different company situations and projects.

These methods cannot be applied universally.

Do you have other thoughts? Let's continue the discussion!

Rather than blaming others, let's improve ourselves!

Continuous Improvement is the key to success.

Let's work hard together!

That's all for today's sharing.

Q&A Time!

Q&A time starts now! Here are some common questions and my answers:

⭐️⭐️⭐️

Hey there, future cloud and IT rockstars! 🌟

Got questions? Drop them in the comments below 💭 - I'd love to share my journey and experiences with you.

Remember, every expert was once a beginner. Keep pushing forward! 💪

View full post