Friday, July 13, 2012

The Shaw Fire And Why It Matters

At 7:12am MDT on July 13, 2012, Alberta Treasury Branch tweeted that their online systems were officially back up, approximately 40 hours after the fire in Shaw Court took out the primary servers of the data center they exist in and water from a sprinkler system took out their backup servers sitting in the same location

I’ve been on a bit of a rant lately about the thoroughness of IT architecture and this unfortunate incident makes me angry.

I know there’s a lot of debate going on around why sprinklers were in the data center and why a non-water-based fire suppression system wouldn’t have been used.  As my buddy Mike D. explains, in a world where inert gas and other forms of fire suppression are very expensive, there are many data centers that actually opt for sprinklers (with an important caveat, which I will explain in a moment).

When hardware was expensive, we tried to save the hardware with non-water solutions.  As hardware became cheaper, the services provided by the hardware (and not the hardware itself) became the priority, which means bouncing control to the secondary site while aggressive fire suppression (including water) deals with the primary location. 

The following statement, released yesterday, explains why this incident makes me angry as an architect:

The system-wide outage was caused when a transformer exploded in an electrical room at Shaw Communications’ downtown headquarters Wednesday afternoon. Although the backup system was activated, when the sprinklers came on, they were also taken out.

This statement violates a basic truth in IT infrastructure.

It doesn’t matter if your building is fire proof, earthquake proof, tornado proof, nuclear bomb proof or whether it has its own nuclear reactor for unlimited power.  It doesn’t matter if error-prone humans are not allowed in the building, replaced by “perfect robots” (created by error-prone humans).

You never put your primary and backup servers in the same place.

There’s one thing that we know about IT and communications.

Murphy’s Law rules everywhere.

When you put your primary and secondary systems together, you are doing so while crossing your fingers, picking a 4-leaf clover, sacrificing a goat to the gods and saying a silent prayer that bad things won’t happen to you.

Most people who put both systems together often do it because:

1. They are saving money

2. They don’t know any better

3. They are overly confident of their solution

4. They don’t care, exposing themselves to Hanlon’s Razor – “Never attribute to malice that which is adequately explained by stupidity.”

Money Rules the Day

I suspect it was reason #1 … well, I hope it was anyway because the other 3 reasons are REALLY problematic.

The reason this event makes me angry is that physical separation of primary and failover servers is basic, teach-the-kids-in-college stuff.

And so when I see some significant names taken out because economics seem to have ruled the day, I wonder what other architectural best practices have been compromised by economics – best practices in the areas of privacy, security or other areas.

I worry because I have seen over the years that the factors listed above tend to not settle in just one area of an organization’s architectural best practices.  Once factors that limit effective solutions are present, they tend to be pervasive through all aspects of an organization’s IT solutions.

If it was for reasons 2-4 (non-financial reasons), the players involved need to be considered for re-education, reprimand or “retirement”, including but not limited to:

1. The architect(s) who designed the solution.

2. The data center facility manager(s) who approved it.

3. The customer service exec(s) who sold it to other orgs (unless they don’t understand it, in which case they shouldn’t be selling it anyway).

Regardless of the reason, the following need to be considered for the same “special treatment”:

1. The leadership team of the creator of the solution.

2. The buyers representing ATB, Alberta Health Services or other groups who evaluated and recommended use of the solution.

3. The leadership team of the buyers who signed off on the solution.

If it was for reason #1 (which, in a twisted sort of way, offers the most comfort), the bean counters now need to reflect on the result of their cost saving venture as they sort out consumer impact and a multi-tier service level agreement involving IBM, Shaw Communications and the many users of the facilities, including ATB, Service Alberta, Alberta Health Services (which cancelled surgeries as a result of the fire) and other groups.

Failures like this matter to all of us since that which we tolerate today becomes the norm tomorrow. And we know what history teaches us:

Those who don’t study history are doomed to repeat it while those who study history are doomed to watch those who don’t to repeat it.

Or maybe, given that similar failures have occurred in the past such as with Aliant 6 years ago, maybe the truth is that:

History teaches us that history teaches us nothing.

The Bottom Line

For me, no matter what the reason for the failure, doubt has been planted in my mind.  Doubt that makes me wonder where else compromises have been made.

And will such compromises produce a 2-day inconvenience the next time or will it be more dramatic or problematic?

Only the architects of the affected organizations really know.

I wonder how many 4-leaf clovers they have in their back pocket.

In service and servanthood,

Harry

 

PS   In reflecting on my experience over the years with data centers, I remembered an interesting incident early in my career.  During a tour of a data center containing classified government information, I was asking questions about the halon fire suppression system.  The system was designed to seal the data center, with no means of reopening the doors or exiting from the inside until the fire was under control. 

As a young, naive IT guy at the time, I remarked that while I saw 20 or 30 people working in the data center, I only saw a small handful of breathing apparatus to be used by these people should escape be required.

With that, he escorted me to his office and pulled out their operations guide.  In it, in clear language that could not be misinterpreted, one policy jumped out at me.

In case of fire, the first priority was to save the facility.

To be able to save the people inside was secondary in importance.

In essence, they were expendable.

Of course, everyone assumed that a fire would never occur in that data center and so such a policy wasn’t questioned. 

But as in the case of the Shaw Court fire, you know what happens when one assumes things.

I would like to think that in today’s world, such a policy within a data center like that couldn’t exist.

But then again, who knows?

 

Addendum: July 14, 2012

Three days after the fire, the impact on Alberta Health Services and other organizations continues to be felt. Public accountability and transparency are essential to understanding what happened and how such situations can be prevented moving forward.

2 comments:

  1. ATB is still experiencing problems, despite online banking being up and running

    ReplyDelete
  2. Restoring a system in the manner that they are doing it is very complex, with challenges that can create bigger problems when not done correctly.

    That is why better planning and architecture can help avoid having to deal with the situation down the road.

    We always pay for the choices we make. Where we decide to pay for them (proactive or reactive) determines the actual cost (beyond financial) that we ultimately incur!

    ReplyDelete