Assessing the risks of human error in the data center
Choice quote from a report on the converging trends of data center builds and high-density computing:
“Respondents report that the majority of data center outages were caused by human error and improper failover.”
Risk assessment is part of the job of every data center operator. In reality it falls within the purview of every IT professional. Why? Because the more we depend upon technology, the more we miss it when it fails. Most entities today can not function without their IT infrastructure. Information Technology at the desktop level is still quite immature and suffers from a myriad of constant failures, both great and small… most no doubt having to do with the dreaded PEBCAK errors.
In a data center environment however the user is physically removed from the equipment and can not cause errors, right? Well, not really. The data still point to human error, and by extension failover issues in redundant systems. I’d be willing to bet that the failover problems were built-in from the start by human error in either the design or implementation phases.
Human error is not a very sexy topic of discussion and rarely gets mentioned in the data center realm. Data centers are frequently painted as completely automated, “lights out” facilities where machines hum along in chilled rooms and those pesky error-prone humans are kept out by ever-present security systems such as man-traps, biometric scanners, and in some bizarre cases, armed guards.
This could not be further from the truth. It is an illusion conjured up by data center marketers. While the people per square meter ratios found in your typical office building won’t be found in a data center, human beings are a vital part of the picture and facilities must take into account their presence and their potential for error.
In my adventurous youth, I spent my free time pursuing thrills in high places. I was an Alpinist. Alpinism has a strong literary streak and part of every climber’s possessions is a collection of books: guidebooks, manuals, epic tales of adventures in remote ranges, and yearly journals of climbing organizations.
One such publication is the American Alpine Club’s Accidents in North American Mountaineering. It is a chronicle of human error, and an accounting of its cost. It is required reading for any alpinist as it is a list of things NOT to do while in the mountains. Yes, there are objective hazards in the mountains such as weather, avalanches, etcetera, but reading accident reports you quickly realize that the greatest risk to any climber’s life is their own misjudgment.
We are taught that human beings learn from their mistakes. Smart human beings learn from other people’s mistakes. That is the purpose of Accidents in North American Mountaineering, an accounting of human error, some of it fatal, for smart human beings to learn lessons from. These sorts of post-event objective analysis are not limited to mountaineering either.
The aviation industry has an excellent record of publishing data about accidents, both major and minor. The military of course has an entire bureaucracy dedicated to gathering and analyzing data, both from history and recent combat. They even have a TLA for the process: AAR, for “After Action Report.”
There is no such widely read chronicle of human error for the data center operator to learn from. That is a shame. I know of significant outages that have happened at facilities, but have no idea of their true cause(s), or what could have been done to prevent them. You hear of them through the rumor mill, you can track them sometimes via blogs, but rarely are the exact causes and sequences of error fully reported. I understand why.
Data centers are supposed to be foolproof facilities dedicated to uptime. In reality datacenter operators have to balance the laws of physics; specifically the Second Law of Thermodynamics and the propensity for human beings to make errors. In other words, stuff breaks and nothing is foolproof. However, at least in the colocation business the sales side of the industry quashes the honesty about outages for fear of lost revenue.
I’m looking at sending my facilities staff to AFCOM’s Datacenterworld Conference but in looking over the schedule I see only one session concerning actual outage post-mortem analysis. Lots of marketing-speak about design and maintenance, but no obvious mistakes to learn from. I realize that as commercial enterprises, many data centers would rather not expose their flaws. Remember here though that the whole point of this exercise is not to assign blame, or make the unfortunate subject to ridicule. The object of this is to learn from the mistakes of others for the betterment of the ENTIRE industry.
Posted: June 14th, 2007 under Data center physical infrastructure, Data center disaster recovery planning, Data center standards and metrics.
No Comments »
No comments yet.