Server Specs - A SearchDataCenter.com blog

Server Specs:

 

A SearchDataCenter.com blog


The blog for all things data center, including, design and infrastructure, Unix, Linux, mainframes and x86 servers, power and cooling efficiency, information technology (IT) service management, server consolidation and virtualization and more.

Communication: A neglected key to surviving a data center disaster

A friend of mine once said “So long as you are not on the train, everyone likes a good train wreck.”

Since I am responsible for maintaining data centers I like to read as much about the occasional “train wreck” in my industry. I find it is beneficial to investigate events with the goal of learning as much from them as possible. As I’ve written about this before here on ServerSpecs, how else can we avoid pitfalls if we don’t observe those made by others?

Since that was written, a few high-profile outages have happened. And through coincidental relationships I was able to get a lot of information about a recent event involving one large Web hosting organization acquiring another, then relocating the servers to another facility. It devolved into the proverbial train wreck and multi-day outages ensued.

I have a very low-tech hobby, something that serves as a distraction from my high-tech daily life, namely maintaining and driving a classic car once owned by my father. Like any oddball pursuit, the Internet has become the glue that holds the far-flung devotees of this particular make and model of car together. I spend my evenings chatting with people all over the globe on a mailing list devoted specifically to this particular car. Through this group I met a person whose business website was hosted by the acquired in the above-referenced data center outage incident. After his website finally came back online I inquired about his experience as an affected party.

His reply underlined an important lesson for data center operators. All too often we focus on the problem at hand. Be it planning a migration, or bringing an outage to an end. Just as often we forget two important factors: Why we are doing it, and who we are doing it for. The “whys” are usually obvious to the point of being forgotten. However it is vitally important to identify those reasons why we do what we do. Even more important is that we communicate the reasons to the people our actions affect.

We look at “our” data centers and see infrastructure: Power, cooling and network systems to be maintained and managed. Servers are often seen as the object of all that care, but in reality those servers serve human beings. Those human beings could be defined as your users, your customers, or in many cases both. So really “our” data centers are not “ours” at all, they exist to serve our customers. We are not here to just manage a facility, we are here to manage a facility on behalf of our customers. They rely on us to do our jobs and keep the data center running.

Ask any user/customer about an incident like an outage or data center disaster they’ve lived through and one thing starts to become very clear: Awareness is more important than uptime. To those of us who live with the view that our paychecks depend upon uptime numbers, this is a startling concept.

If you ponder it for a while however it makes sense. Human beings are tremendously adaptable, provided we are aware of our circumstances. We can handle a disruption of our routine if we are able to plan for it, understand its duration, and can anticipate its affect. But, if handled in a fashion that they view as arbitrary, capricious, or inconsiderate those same humans become angry, bitter, and frustrated.

In my conversation with my acquaintance who survived the multi-day data center outage he summarized with thoughts like:

“Things they did wrong? I can only guess at half of it. Very little communication with customers. The move was postponed once, postponed twice, and abandoned, without explanation. By Monday, there was essentially no communication at all. Through the critical first 48 hours, the news was glowing. Through the next desperate 24, they were incommunicado. Then came the lies…the network is down because of a DoS attack, they always planned to move the servers, most sites are up, all sites are up, etc.”

Note the overall expression of frustration in those words, and the central theme of feeling ignored and abandoned. He had specific technical complaints about procedure, but note the most important theme recurring again in this statement:

“In an apparent attempt to limit network traffic, they brought up servers, but seem to have blocked certain protocols. Having a server up and running for e-commerce, yet being unable to control it with ftp, ssh, telnet, etc could have been a disaster within a disaster for some. E-mail service was among the last things that came back, so you couldn’t even communicate with your own clients.”

By not communicating with him he felt abandoned. By cutting off his means of explaining the situation to HIS customers the problem was compounded exponentially. While this is a single cited case, the same theme crops up again and again when you talk to the end-users affected by data center outages.

So what lessons are to be learned? How can data center operators deflect the ire of their customers while managing a disaster?

The simple answer of course is “Communicate with the customer.” Actually accomplishing that in the midst of crisis however is not as simple as it would seem. If you find yourself on that proverbial train in the process of going off the rails, how do you communicate on top of all the other, seemingly more important tasks? Technical people have often been stereotyped as poor communicators as well .

“All the social graces of a highly trained engineer” is an achingly funny phrase with plenty of truth buried within it. In part two of this article I’ll delve into that stereotype and elaborate on simple communication strategies for data center, or indeed any sort of IT Management role.

2 Comments »

  1. “How can data center operators…” - the operators (technicians, engineers, etc.) need MANAGERS who know how to “deflect the ire” and who know how to plan for disasters, including disastrous moves. The planning for a datacenter move, for instance, must include planning for how and when to communicate with customers before, during and after the move and especially if things go wrong.

    Planning must include working through various scenarios and having alternatives. Plans for keeping customers informed are just as important as the plans for the electrical connectors, the new network infrastructure and the cooling system.

    If communicating with your customers is normally done by email, and your email system is part of or relies on the equipment being moved, you’d better have a backup plan that will work even if that equipment gets dropped off the truck. Don’t laugh, it happens. Do you have a list of customer POC names and telephone numbers? Have you called those numbers lately?

    Do you expect the same person who’s physically responsible for moving/wiring/programming/etc. to also have the time and composure to contact customers when something goes wrong?

    Comment by SonicBee777 — February 7, 2008 @ 2:19 pm

  2. Good points, and thanks for commenting. A lot of this is explored in part two. I am a bit ambivalent about limiting the communication responsibilities to ONLY managers. Over the years of “geek wrangling” one thing I’ve learned is that solid technical people, no matter what their title may be, should have the ability to communicate to users and customers. I do agree with you though that CRITICAL communications in the midst of serious issues should come from a position of leadership and authority.

    –chuck

    Comment by cgoolsbee — February 8, 2008 @ 5:22 pm

TrackBack URL

Leave a comment