Showing posts with label ATB. Show all posts
Showing posts with label ATB. Show all posts

Friday, July 13, 2012

The Shaw Fire And Why It Matters

At 7:12am MDT on July 13, 2012, Alberta Treasury Branch tweeted that their online systems were officially back up, approximately 40 hours after the fire in Shaw Court took out the primary servers of the data center they exist in and water from a sprinkler system took out their backup servers sitting in the same location

I’ve been on a bit of a rant lately about the thoroughness of IT architecture and this unfortunate incident makes me angry.

I know there’s a lot of debate going on around why sprinklers were in the data center and why a non-water-based fire suppression system wouldn’t have been used.  As my buddy Mike D. explains, in a world where inert gas and other forms of fire suppression are very expensive, there are many data centers that actually opt for sprinklers (with an important caveat, which I will explain in a moment).

When hardware was expensive, we tried to save the hardware with non-water solutions.  As hardware became cheaper, the services provided by the hardware (and not the hardware itself) became the priority, which means bouncing control to the secondary site while aggressive fire suppression (including water) deals with the primary location. 

The following statement, released yesterday, explains why this incident makes me angry as an architect:

The system-wide outage was caused when a transformer exploded in an electrical room at Shaw Communications’ downtown headquarters Wednesday afternoon. Although the backup system was activated, when the sprinklers came on, they were also taken out.

This statement violates a basic truth in IT infrastructure.

It doesn’t matter if your building is fire proof, earthquake proof, tornado proof, nuclear bomb proof or whether it has its own nuclear reactor for unlimited power.  It doesn’t matter if error-prone humans are not allowed in the building, replaced by “perfect robots” (created by error-prone humans).

You never put your primary and backup servers in the same place.

There’s one thing that we know about IT and communications.

Murphy’s Law rules everywhere.

When you put your primary and secondary systems together, you are doing so while crossing your fingers, picking a 4-leaf clover, sacrificing a goat to the gods and saying a silent prayer that bad things won’t happen to you.

Most people who put both systems together often do it because:

1. They are saving money

2. They don’t know any better

3. They are overly confident of their solution

4. They don’t care, exposing themselves to Hanlon’s Razor – “Never attribute to malice that which is adequately explained by stupidity.”

Money Rules the Day

I suspect it was reason #1 … well, I hope it was anyway because the other 3 reasons are REALLY problematic.

The reason this event makes me angry is that physical separation of primary and failover servers is basic, teach-the-kids-in-college stuff.

And so when I see some significant names taken out because economics seem to have ruled the day, I wonder what other architectural best practices have been compromised by economics – best practices in the areas of privacy, security or other areas.

I worry because I have seen over the years that the factors listed above tend to not settle in just one area of an organization’s architectural best practices.  Once factors that limit effective solutions are present, they tend to be pervasive through all aspects of an organization’s IT solutions.

If it was for reasons 2-4 (non-financial reasons), the players involved need to be considered for re-education, reprimand or “retirement”, including but not limited to:

1. The architect(s) who designed the solution.

2. The data center facility manager(s) who approved it.

3. The customer service exec(s) who sold it to other orgs (unless they don’t understand it, in which case they shouldn’t be selling it anyway).

Regardless of the reason, the following need to be considered for the same “special treatment”:

1. The leadership team of the creator of the solution.

2. The buyers representing ATB, Alberta Health Services or other groups who evaluated and recommended use of the solution.

3. The leadership team of the buyers who signed off on the solution.

If it was for reason #1 (which, in a twisted sort of way, offers the most comfort), the bean counters now need to reflect on the result of their cost saving venture as they sort out consumer impact and a multi-tier service level agreement involving IBM, Shaw Communications and the many users of the facilities, including ATB, Service Alberta, Alberta Health Services (which cancelled surgeries as a result of the fire) and other groups.

Failures like this matter to all of us since that which we tolerate today becomes the norm tomorrow. And we know what history teaches us:

Those who don’t study history are doomed to repeat it while those who study history are doomed to watch those who don’t to repeat it.

Or maybe, given that similar failures have occurred in the past such as with Aliant 6 years ago, maybe the truth is that:

History teaches us that history teaches us nothing.

The Bottom Line

For me, no matter what the reason for the failure, doubt has been planted in my mind.  Doubt that makes me wonder where else compromises have been made.

And will such compromises produce a 2-day inconvenience the next time or will it be more dramatic or problematic?

Only the architects of the affected organizations really know.

I wonder how many 4-leaf clovers they have in their back pocket.

In service and servanthood,

Harry

 

PS   In reflecting on my experience over the years with data centers, I remembered an interesting incident early in my career.  During a tour of a data center containing classified government information, I was asking questions about the halon fire suppression system.  The system was designed to seal the data center, with no means of reopening the doors or exiting from the inside until the fire was under control. 

As a young, naive IT guy at the time, I remarked that while I saw 20 or 30 people working in the data center, I only saw a small handful of breathing apparatus to be used by these people should escape be required.

With that, he escorted me to his office and pulled out their operations guide.  In it, in clear language that could not be misinterpreted, one policy jumped out at me.

In case of fire, the first priority was to save the facility.

To be able to save the people inside was secondary in importance.

In essence, they were expendable.

Of course, everyone assumed that a fire would never occur in that data center and so such a policy wasn’t questioned. 

But as in the case of the Shaw Court fire, you know what happens when one assumes things.

I would like to think that in today’s world, such a policy within a data center like that couldn’t exist.

But then again, who knows?

 

Addendum: July 14, 2012

Three days after the fire, the impact on Alberta Health Services and other organizations continues to be felt. Public accountability and transparency are essential to understanding what happened and how such situations can be prevented moving forward.

Thursday, July 12, 2012

So How Secure Are We Anyway?

I was in the process of completing my annual report on security vulnerabilities yesterday when the news reported that an explosion in a communication hub in downtown Calgary had compromised landline and 911 service for 30,000 Shaw customers, including some municipal and provincial services.

As I write this this morning, service is almost completely restored.

No biggy …. they only lost service for 12 hours or so, right?

Well, maybe …. but where was the redundancy that should have prevented the failure from impacting those affected?

Here was the cause for the failure:

The system-wide outage was caused when a transformer exploded in an electrical room at Shaw Communications’ downtown headquarters Wednesday afternoon. Although the backup system was activated, when the sprinklers came on, they were also taken out.

I guess they didn’t think of or couldn’t afford a non-water-based fire suppression system, typical for rooms containing mission-critical computer or communication equipment nor did anyone consider the impact of a total site loss, given that they kept the backup system in the same building as the primary system.

Then I think about the time I was in Newfoundland when a fire in a communication hub took out land lines, cell phones, Internet and all forms of communication (thus knocking out any use of debit / credit cards).  The outage was only hours in duration but while the event was in progress, spokespersons for Aliant (the communication company that owned the building) were saying they had no idea when the outage would be corrected, creating extra concern at the time.

Was there redundancy of technology in this situation to protect consumers against a catastrophic failure?

Yes, according to Aliant.  They had full redundancy of all systems.  Unfortunately, the primary and backup systems were in the same building and shared a common power supply.

Where did the failure occur?

You guessed it – the power supply.

So much for redundancy in either of these events.

Ironically, the Aliant redundancy mistake, which occurred six years ago, was studied by information and communication providers across Canada to make sure no one repeated the same mistakes in the future.

Ooops.

When the World Trade Center came down, some of the major communication providers had been using it as a communication hub.  After all, they figured, what are the odds that we could lose the entire site?

Sadly, we know the answer and communication in the NYC area was compromised as a result of the WTC collapse and an excessive number of people using the system in the hours of terror that followed.

When we build communications systems such as these, we strive to strike a balance between need and cost, factoring in the probability of various external factors and events.  We don’t build systems that can handle everyone and everything because, as we like to think, what is the likelihood of a worst case scenario occurring.

As we proved in NYC on 9/11, the likelihood is low but when we need it, the importance of having systems that can handle emergencies is critical.

But alas, I digress ….. on to my originally intended subject.

My Security Report

As part of what I do as a strategy advisor and global technology architect, I provide services to some clients in the areas of assessing security vulnerabilities.

Specifically, how secure are various client’s IT infrastructures, what can be done to enhance their security and should a compromise occur, how quickly can the compromise be neutralized while minimizing the impact of the compromise?

The contents of my report, which will be distributed to specific organizations, shows a number of interesting slices of society that are vulnerable to attack.

The list includes, but is not limited to:

- Specific large-scale banks and credit card providers

- Specific health-care providers

- Specific municipal, state and provincial governments

- Specific airlines

- Specific energy generation / distribution groups

- Specific infrastructure organizations, including some that govern water distribution and public transit

- A specific Roman Catholic Archdiocese that has been rocked by pedophile priest prosecution in the past and is alleged to be hiding a list of known pedophile priests (unknown to the public) who are still active priests

- Other large corporations in manufacturing and retail

- Other entities whose “commercial” nature I am not allowed to mention here.

The vulnerabilities range in nature and scale but the bottom line is this.

There is still way too much vulnerability in our infrastructure, whether it be in our communication infrastructure, in the security and privacy of our data and in national security overall.

Why Is This Happening?

Some folks do the best they can with the limited funding they are given by their leadership - leadership that downplays the risks of not having a thorough solution or who don’t understand the impact to their organization, public or private, and the people they serve should a compromise occur.

Some organizations, governed by greed, pour their efforts into maximizing return, assuming that creating secure, redundant  architecture is just a money-wasting venture that impacts their bottom line unnecessarily.

Some organizations create solutions so complex that obvious vulnerabilities slip by them and they watch in dismay as the seemingly ultimate in technology falls to simple attempts to compromise them.

Some organizations have a lack of knowledge about the threats they face and what is needed to neutralize the threat.  I saw with amusement (and concern) last year when a national retailer placed a classified ad looking for someone to take charge of the design and implementation of a security solution for their entire corporation.

Why was I concerned?  The minimum requirement for the position was a high school diploma.  No other experience, education or security solution background was required.  I guess they will learn on the job.

All of this being said, I still believe that ego and an excessive amount of hubris is responsible for most of the problems we face today.

Beliefs such as “nobody can defeat my security solution” or “the likelihood that compromise or disaster will hit us is minimal” are responsible for many of our compromises, both the ones that make the press and the ones that people on “the outside” never hear about.

How much of a problem is this?

A significant one.

While billions of dollars go into airline security and border control annually, I believe we face a much larger threat when it comes to the security and redundancy of our infrastructure then we do from someone taking a plane out of the sky or sneaking something across the border.

Much of the knowledge of how to compromise, penetrate, steal from or cause the failure of communications and IT infrastructure is available in the public domain.  We face multiple threats ranging from the seemingly benign example of kids trying to hack into the local high school to get the answers to an upcoming exam up to agencies (including foreign governments) attempting,  sometimes successfully, to penetrate our critical corporate, government and military computer systems.

The head of the National Security Agency recently said we need to pour more resources into beefing up our cyber security, causing many people to cry foul that Big Brother was using this as a guise to exert even more control over us.

While I am wary of how much insight government has into our private matters, this is one area where we must not underestimate the need to invest more into protecting our technology assets.

I once asked a well known US / UK military advisor-turned-journalist how he dealt with his knowledge of our vulnerabilities and this was his reply:

“I try not to stay sober”

Now that’s a sobering thought.

The people in my industry (information and communication technology) need to do a much better job at enhancing the security of the citizens of the world at the personal, corporate, national and global levels.

The people who provide funding and make the go / no-go decisions that enable / restrict the people in my industry need to be better informed about the importance and impact of their decisions in supporting such ventures.

And each of us, while varying in levels of technical savvy, must do our best to hold all of these organizations responsible and accountable to do the best they can.

And we’re a long way from doing the best we can.

Many organizations, private and public, have knowingly or unknowingly created ticking time bombs that will impact all of us.

Acknowledging this is not “sky is falling” pessimism.

Acknowledging it is the only way it gets fixed before we get punished for not taking appropriate action.

This is not pessimism.

It is reality.

Most of us say we would do anything to protect the security of our families, our businesses, our nations and the world.

It is time to prove it … with a sense of urgency and appropriate action commensurate with the threats that exist.

In service and servanthood,

Harry

Addendum – July 12, 2012

This news story (about compromised Yahoo accounts) that broke an hour after I wrote the blog is a reminder of our personal responsibility to ensure the integrity of our personal information on the web.

And then a bank went down …..

I noticed that a Canadian bank, more than 24 hours after the previously noted fire in Calgary, still does not have an online presence as a result of this outage in one building.

Here is what Alberta Treasury Branch customers (both personal and corporate accounts) receive if they go to access their online accounts for bill paying and such (emphasis shown is theirs):

We're Sorry...

The fire at the Shaw Court building in Calgary yesterday caused our banking system to go down.

Overnight we moved our system to a back up location. We are working to resume normal services, and anticipate that it will take us a bit of time.

Meanwhile, ABM, debit cards and MasterCards are available, and our branch staff will also be able to assist customers.

We are currently working to restore ATB Online banking and ATB.com as soon as possible, please check back here or on our ATB Financial Twitter account (@atbfinancial) for updates.

We can not access emails right now, so please call your local branch directly if you have questions or require assistance. Our Customer Care Centre associates (1-800-332-8383) are also available to provide more information. And remember, we never contact you via text or email to ask for your personal or banking information.

© 2011 ATB Financial | All Rights Reserved. TM Trademark of Alberta Treasury Branches. Unauthorized access is prohibited. Usage may be monitored. Please visit our website at www.atb.com

So an electrical fire in one building has derailed the online processing for an entire bank for an entire province.

Not comforting nor an unacceptable architecture, in my opinion.

Addendum – 5:40 PM MDT

I received the following note which I couldn’t help but share :-)

Dear Mr. Tucker,

My name is Dxxxxx and I live in xxxxxxx, Alberta.  I am a customer of ATB and because I am on the road, I need to pay some bills today using their online system and of course I cannot. When RIM went down last October, I had to deal with a lot of angry customers and almost lost one because of my inability to respond quickly to them.  With all the firefighting I had to do with my customers because of RIM, I got some free games from RIM for my trouble.

Since I need to explain to some people why I can’t pay my bills today, do you think that ATB will offer me some free games also?

I guess on days like this you need a sense of humour.

Cheers,

Dxxxxx

Dear Dxxxxx,

I hear that the new Angry Birds is pretty cool and might be appropriate. :-)

Thanks for the note!

Harry

Addendum: July 17, 2012

This little ditty was announced on July 17, 2012.  Info about up to 2.4 million voters may be compromised: Elections Ontario.  Preventable and sadly …. predictable.  We can do better and must do better.

Addendum: August 7, 2012

Here’s how easy one can be compromised.  If we are in the IT industry, we need to demand better of ourselves.  If we are not in IT, we need to demand better from those who are.