Server Specs - A SearchDataCenter.com blog

Server Specs:

 

A SearchDataCenter.com blog


The blog for all things data center, including, design and infrastructure, Unix, Linux, mainframes and x86 servers, power and cooling efficiency, information technology (IT) service management, server consolidation and virtualization and more.

Data Center Communications 101 (aka, How to survive a disaster, part 2)

IT folks adopted technologies like message boards, SMS messages, email, chat protocols, web logs, RSS, and micro-blogging long before they were widely adopted. How can the data center manager put these communications channels to use as disaster recovery tools? The simple answer is “use them.” However to use them successfully is not that simple. This is part two of the column Communication: A neglected key to surviving a data center disaster.

“How can you tell an engineer is an extrovert?”

“They look at your shoes instead of their own when they talk to you.”

Yes, it is a funny stereotype, and like all stereotypes it has a kernel of truth. Engineers, technical people, geeks, whatever term you want to apply to us (I say “us” because I certainly fall on this side of the dividing line) share that particular trait to a certain degree. We tend to be thinkers and analyzers, drawn to interacting with machines. When we have something to say it is frequently seen as blunt. However, that does not mean that technical people are not capable of being great communicators.

Quite the opposite in fact: We tend not to speak until we are certain of what we say. Additionally we embrace communications technology and put them to use long before they become mainstream. Think about the history and adoption curve of these recent technologies: message boards, SMS messages, email, chat protocols, web logs, RSS, and micro-blogging. What sort of people used these technologies before they were widely adopted? That’s right… us.

How can the data center manager put these communications channels to use as disaster recovery tools? The simple answer is “use them.” However to use them successfully is not that simple.

First you have to choose the appropriate tool for this external communications channel. In some instances email may be the best, in others a web log with an RSS feed may be the better choice. It may not even involve a “high tech” solution, as it can also be something as simple as a welcome/status message on your phone system, or even a “scoreboard” hanging above your cubicle. The key is to pick something that works within the context of your organization, and most importantly, will be effective in touching the greatest number of your users or customers with the least amount of effort.

All the same considerations you consider when designing a critical system have to be taken into account, when you make this decision. It must be able to function in a highly reliable fashion, even during an outage. It must also have a secondary and even tertiary backup system in order to continue to function in the case of an emergency that disables your primary system.

Next, and this is the critical step, you have to use it. Not just as harbinger of doom, also make it the teller of your tales, the bulletin of the boring, the messenger of the mundane. Use it constantly to inform of what is happening within your facility, your network, your systems, even your staff. Announce every scheduled maintenance interval, every successful circuit installation, the delivery of new equipment, everything.

This accomplishes two things. First it gets you and your staff in the habit of updating your external communications channel. Something happening? It gets communicated. It is critical that this becomes an ingrained habit, something is happening? Relay the status as well as investigate. Making progress? Update the status, and keep working on it. If one staff member is not directly involved in the activity, then they can certainly be the updater of information via the external communications channel.

Secondly, and most importantly this trains your users and customers to look to the channel for all data concerning your data center status. If you have set an expectation among all your users and customers about how they learn what is going on within your realm, then when the going gets tough, whether dealing with an outage, or executing a major facility migration, or whatever circumstance throws your way, then you can keep them informed. Informed users are satisfied users. Remember that paradigm shifting conclusion from part one: Awareness is more important than uptime.

Your users and customers will tolerate incidents of downtime, but ONLY IF they are aware of them, and kept informed as to what is happening, and what is being done to restore things to an up state. By keeping them in a state of constant communication, even of mundane day-to-day operational matters, you build their trust while you keep them aware. The payoff comes when you have that big project or that unplanned outage and rather than being assaulted from multiple directions via multiple channels asking you for status, your users and customers refer to your now well-established external communications channel.

In the event of a large project — a data center migration, for example — you can use your communication channel to provide a far-ahead warning of what is going to happen. You can provide details of what is going to be done and how it will be done. You can elaborate on schedules, fall-back plans, contingencies, and expected results. You can announce reminders in the days and hours leading up to the start of each segment. You can post after-action reports of the successful, and perhaps unsuccessful moves. You can announce the ultimate completion of the project.

In the case of an outage, you can post updates and ETAs, you can use your channel to inform your users and customers of the root cause and what has been done to prevent the issue from recurring.

The purpose of this is to keep your users and customers informed. By remaining informed they will build trust in you. That trust is built through awareness of not only the critical, but the mundane and day-to-day. By staying aware, issues within your realm are perceived by your users in a positive light, and things which would have been seen as full blown train wrecks had they been unaware, are now likely to be seen as mere speed bumps.

Communication: A neglected key to surviving a data center disaster

A friend of mine once said “So long as you are not on the train, everyone likes a good train wreck.”

Since I am responsible for maintaining data centers I like to read as much about the occasional “train wreck” in my industry. I find it is beneficial to investigate events with the goal of learning as much from them as possible. As I’ve written about this before here on ServerSpecs, how else can we avoid pitfalls if we don’t observe those made by others?

Since that was written, a few high-profile outages have happened. And through coincidental relationships I was able to get a lot of information about a recent event involving one large Web hosting organization acquiring another, then relocating the servers to another facility. It devolved into the proverbial train wreck and multi-day outages ensued.

I have a very low-tech hobby, something that serves as a distraction from my high-tech daily life, namely maintaining and driving a classic car once owned by my father. Like any oddball pursuit, the Internet has become the glue that holds the far-flung devotees of this particular make and model of car together. I spend my evenings chatting with people all over the globe on a mailing list devoted specifically to this particular car. Through this group I met a person whose business website was hosted by the acquired in the above-referenced data center outage incident. After his website finally came back online I inquired about his experience as an affected party.

His reply underlined an important lesson for data center operators. All too often we focus on the problem at hand. Be it planning a migration, or bringing an outage to an end. Just as often we forget two important factors: Why we are doing it, and who we are doing it for. The “whys” are usually obvious to the point of being forgotten. However it is vitally important to identify those reasons why we do what we do. Even more important is that we communicate the reasons to the people our actions affect.

We look at “our” data centers and see infrastructure: Power, cooling and network systems to be maintained and managed. Servers are often seen as the object of all that care, but in reality those servers serve human beings. Those human beings could be defined as your users, your customers, or in many cases both. So really “our” data centers are not “ours” at all, they exist to serve our customers. We are not here to just manage a facility, we are here to manage a facility on behalf of our customers. They rely on us to do our jobs and keep the data center running.

Ask any user/customer about an incident like an outage or data center disaster they’ve lived through and one thing starts to become very clear: Awareness is more important than uptime. To those of us who live with the view that our paychecks depend upon uptime numbers, this is a startling concept.

If you ponder it for a while however it makes sense. Human beings are tremendously adaptable, provided we are aware of our circumstances. We can handle a disruption of our routine if we are able to plan for it, understand its duration, and can anticipate its affect. But, if handled in a fashion that they view as arbitrary, capricious, or inconsiderate those same humans become angry, bitter, and frustrated.

In my conversation with my acquaintance who survived the multi-day data center outage he summarized with thoughts like:

“Things they did wrong? I can only guess at half of it. Very little communication with customers. The move was postponed once, postponed twice, and abandoned, without explanation. By Monday, there was essentially no communication at all. Through the critical first 48 hours, the news was glowing. Through the next desperate 24, they were incommunicado. Then came the lies…the network is down because of a DoS attack, they always planned to move the servers, most sites are up, all sites are up, etc.”

Note the overall expression of frustration in those words, and the central theme of feeling ignored and abandoned. He had specific technical complaints about procedure, but note the most important theme recurring again in this statement:

“In an apparent attempt to limit network traffic, they brought up servers, but seem to have blocked certain protocols. Having a server up and running for e-commerce, yet being unable to control it with ftp, ssh, telnet, etc could have been a disaster within a disaster for some. E-mail service was among the last things that came back, so you couldn’t even communicate with your own clients.”

By not communicating with him he felt abandoned. By cutting off his means of explaining the situation to HIS customers the problem was compounded exponentially. While this is a single cited case, the same theme crops up again and again when you talk to the end-users affected by data center outages.

So what lessons are to be learned? How can data center operators deflect the ire of their customers while managing a disaster?

The simple answer of course is “Communicate with the customer.” Actually accomplishing that in the midst of crisis however is not as simple as it would seem. If you find yourself on that proverbial train in the process of going off the rails, how do you communicate on top of all the other, seemingly more important tasks? Technical people have often been stereotyped as poor communicators as well .

“All the social graces of a highly trained engineer” is an achingly funny phrase with plenty of truth buried within it. In part two of this article I’ll delve into that stereotype and elaborate on simple communication strategies for data center, or indeed any sort of IT Management role.

Ramps and staircases: Economic realities of the data center business

Boom, bust, drought, and recovery, and now boom again. Is bust next? If so, when?

This post on Rich Miller’s DataCenterKnowledge.com got me thinking about those cycles, and the whys behind them. Mind you, I am not a journalist, nor an analyst sitting on the outside looking in, I am right in the thick of things working “in the trenches” of the data center economy. I have seen every one of those steps in the cycle in the last decade, while building, filling, moving, decommissioning, and expanding data centers. Demand of course is what drives the need to build facilities and demand is high right now when compared to the supply of data center space. Reading the trade press you see announcements every day of this company or that one expanding existing data centers, or committing to build new ones.

It was only a matter of time before analysts and journalists began asking and speculating about the next phase in the cycle. When will there be too much capacity? When will the market demand drop? Will supply outstrip demand? How much is too much data center capacity?

Speaking from a position inside the industry they are asking the wrong questions. Typical for analysts, especially the financial and investment sort, they are thinking of the right now, the next quarter results. If they would switch their macro lens for a wide-angle they might realize some simple truths about this business.

Econ 101

Demand
Conventional wisdom says that strong demand is what is driving growth, in the form of data center construction right now. It is also what drove the data center building boom of 1999-2001. However, if you look at demand from a long view, it has grown in a very linear fashion since the initial spark of the web-driven economy in 1993. The growth in the number of web servers has been steady, even through the “bust” days of 2001-2005. The Internet did not cease growing, rather, the rate of growth lessened slightly. There is no curve on the graph, just a long ramp at varying angles. Sharp inclines occur in 94-95, 99-01 and 06-present with shallower inclines at other times. The only graphs that have declines are those tracking pricing and players. Prices plummeted in the bust of 01-02 and are only now going up again. The same goes for players on the field. The bust of 01-02 left a lot of dead bodies. The companies died, but in many cases their carcasses, in the form of their datacenters, remain.

Supply
The data center growth graph may be a steady ramp, but the capital required to expand or build a datacenter is more of a staircase graph. Building data centers is an EXPENSIVE exercise. Significant amounts of money are involved, especially when commodities are in demand and prices are high. During the last boom fiber was one of the expensive limiters. Today the price of raw copper (for electrical infrastructure) and the scarcity of Diesel generators are the most acute pain points. The frequently quoted metric is $1000 per square foot to build a data center, however, building a facility to meet the current electrical and cooling demands costs much more. It is estimated (though not confirmed) that Google spends three times that figure on their data centers.

Where does this leave the data center operator? Spending only when the capital is available. If the analysts on Wall Street were smart, they would have been telling the investment community to put money into data centers between 2002 and 2006. Unfortunately they aren’t as smart as their salaries would seem to indicate. Trying to find capital to build or expand a data center at that time proved harder than finding WMDs in Iraq. The supply curve was on the flat of the staircase-shaped graph.

The demand ramp passed the supply staircase about eighteen months ago. Those mothballed “carcass of the bust” facilities were the first to be turned up as the demand incline grew sharper, mostly because the capital required to upgrade them, or in some cases just finish them, was suddenly available. Data centers built in 1999-2002 are just now showing profits, but they were also built to older specs, and are unable to manage the density that current deployments demand. This means they are approaching capacity, requiring more data center construction. That requires big bucks.

Now the big money is flowing, so data center operators are building what they can, while they can. The graph is on the vertical and looking to cross that steady demand ramp. Are they overbuilding? Of course they are. They are banking space now to tide them over after the capital flow stops, because it will stop. If history is a guide, the stop will come sometime in the next year or two. The valuation of Savvis is just the first hint that it is coming. We will have to be content with what was built, because the capital will stop flowing and we will coast along on that flat line until the demand ramp passes the supply and the investment begins again.

Market Maturity
Data centers being built today are going to exceed the current demand because they must continue to operate through the next cycle. It is a crazy system. A crazy system driven by simple factors. Unless we can find courageous investors who think further ahead than the next few quarters, we are stuck with it until the market matures. So is another bust coming? Yes, because the system dictates that one must. But in reality it is not really a bust as much as the point where the supply and demand curves swap positions. If you look at history and examine the infancy of any capital- and construction-intensive modern service industry, be it railroads or aircraft, or even telecommunications, you will see these curves swinging and intersecting in boom/bust cycles that modulate towards stability in the long term as the market matures. This is a very basic view, not taking into account other economic factors such as market and commodity prices; growth, failure and aggregation through M&A; technology shifts, and the like. Just basic Econ 101 supply and demand. Eventually the investment community will realize that it can not dump/withhold capital based on their limited perception of the market. Investment will smooth the supply staircase into a ramp that hopes to stay just ahead of the demand ramp. The very definition of a stable market.

A Myth Busted: 1U servers do not provide greater density.

It sure would make life a lot easier if the data centers we manage existed entirely within two dimensions; which seems to be the world that server manufacturers think we live in. To them, the only two specifications of size that matter are height and width. They could care less about that forgotten dimension, DEPTH.

Every time a new server appears on the market the very first spec I check is depth. Width is a given, and heights are limited to a very narrow range (1U, 2U, etc), but the makers of gear destined for the data center seem to think they’ve been given a free pass go as deep as they please. This drives me and I can only assume my peers in the community, crazy! Nothing would please me more than to get a bunch of Dell, IBM, HP, Apple, etc server hardware engineers into a room … and then flood it with Halon. Oh, OK maybe not halon… I’d probably hit them with FM200 and then when they come to take them for a tour of a data center and show them the error of their ways.

My biggest beef here is the 1U servers that seem to be growing to absurd depths. The worst offenders I’m dealing with at the moment are Dell’s 1950 and Apple’s latest version of the Xserve. Both arrive at 30″ (76.2cm) or longer. I’m sure there are others that have reached these lengths too. They have roughly the same form factor as the flight deck of an aircraft carrier. 1U x ~18″ x REEEEAAAAALLLLLLLYYYYY LOOOOOOONG. Attach a catapult and you could be launching Maverick and Iceman in their Tomcats to intercept the inbound bogeys.

When 1U servers started appearing they were rather compact, akin in form factor to your average Ethernet switch. 1U x ~18″ x ~12″. Some, such as the old Cobalt “RAQ” web servers (remember those?) could be stacked on both sides of a 2-post rack for a total of 84 servers in under 6sq’ of floor space. When the larger players started shipping 1U boxes, they ranged in depth from 20″ to 24″ (51cm - 61cm) on average. This was the same, or a bit longer than the average 2U and 4U boxes that preceded them, but still manageable. They could fit in 2-post or 4-post racks. But about five years ago 1U servers started getting longer and longer.

How does this affect us? Density of course. While having nothing but 1U servers would seem to be a step towards higher density that is really only true if you live in a two-dimensional world. If your servers are now over twice as deep as they once were entire rows of datacenters have to be moved farther apart to accommodate them. Logically if your rows of racks are farther apart, the number of racks you can install in your datacenter shrinks.

Additionally cabinets keep getting deeper to accommodate these longer and longer servers. It used to be that any server could be mounted in a 2-post rack. Cabinets were only needed if extra security was desired. Now the manufacturers of server EXPECT you to mount them inside cabinets. No flexibility in mounting is offered… except maybe cage nuts or tapped.

I remember when a cabinet averaged 32″ (81cm) deep. Many of today’s servers won’t even fit in a 32″ deep cabinet. Well, they might fit, but you won’t be able to close the doors anymore! The cabinets we’re buying for our data center are now 42″ (107cm) deep. That adds almost two feet (61 cm) to every aisle in the facility. That means you can fit fewer aisles. By my top-of-my-head math that means you lose two full aisles for every 5000 sq’ of data center. Depending on the number of racks-per-aisle, that can add up to a LOT of servers you lose by having these outrageously long boxes.

There are many facilities, primarily older colocation datacenters that limit how much power-per-rack you can use, so frequently you see 42U cabinets with MAYBE 14U of space that is usable. Why even bother with 1U servers then? Your cooling is messed up by all the empty space. You might as well go back to the big 4U servers of yesteryear and pack ‘em in. But nobody makes those anymore. The only time you see servers larger than 1U is when they are serious power hogs, packed with drives and CPU. So we’re back to square one.

Two recent events triggered this rant:

A customer sent a new Apple Xserve to us to replace their old Apple Xserve. The old one was a G5 unit, the new one a Dual-Core Xeon unit. Both we and the customer thought this would be an easy swap… power down, unplug, and pull out the old one, slide in the new one plug in and power up. Minimal downtime. Unfortunately the new one is two inches longer, the ports (network and power) have swapped sides, and the rack mounting hardware is completely different. What should have been a 5 minute operation turned into a multi-hour ordeal.

The next event, which sent me over the top is a new client had 32 Dell 1950’s shipped to our facility, along with an APC Netshelter cabinet and powerstrips to plug it all in. Upon arriving for assembly we noted that the rackmounting rails provided by Dell stuck out in the back of the 32″ (81cm) deep 1950 by an additional 3″ (7.6cm). So now the total depth of the servers amounted to 35″ (89cm). There was no longer enough room at the back of the cabinet to mount the power strips. Comparing these to previously installed Dell 1950 servers they did not have rails this long. What does this mean? Is Dell planning on making their next rackmount servers even LONGER? How long before we see 36″ long servers?

Do server designers ever try to actually rackmount their gear? Do they account for cables, power strips, etc? It seems to me that the unrestricted lengthening of the standard 1U server is becoming completely counter-productive to the original design goal of the 1U server, namely density of computing in the minimum amount of space. They’ve gained rackspace at the expense of usable FLOOR space. In the balance sheet of datacenter operations floor space is WAY more expensive than rackspace. I want my floor space back.

Does this frustrate you as much as it frustrates me?

Data center raised floor vs. solid debate

I just slogged my way through Douglas Alger’s 5-page excerpt from a Cisco Press White Paper purportedly discussing the merits of raised floor versus non-raised floor designs for data centers. It spends four paragraphs of the first page telling you why overhead distribution on a solid floor is not good, then rambles on for the next 4.5 pages telling you all about raised floors. It appears by that fact, and from several statements by the author sprinkled throughout the paper, that he has a strong preference for raised floor. Some of his statements about overhead infrastructure are just plain wrong, or easily mitigated. Perhaps he’s never even managed a solid floor facility? So much for a thorough analysis!

Given that I am involved in the management of two facilities, both designed at the same time, but one using raised floor and the other a solid floor with overhead infrastructure, I feel like I can present a more balanced viewpoint. I agree with most of what Mr. Alger says about raised floors, both their strengths and weaknesses. He neglects a few glaring issues with raised floors, and highlights a few of their annoyances quite well, such as tile/cabinet drift. What Alger fails to do is explore the benefits of a solid floor data center; therefore let me lay those out for you:

Floor Load
Alger is living in the past when he talks about “heavy” racks weighing 1500lbs. In today’s high-density reality, 1500lbs is a lightweight installation. The average installation we are seeing in our facilities today is 1800 lbs. We have several cabinets that exceed 3000lbs! I don’t see this trend changing any time soon. When people have 42RU to use, or to put it more bluntly, 42RU that they are paying for, they are going to stuff it with as much as they can. This is where a solid floor really shines above raised. Got a big, heavy load? Roll it on in and set it down wherever you please. No ramps to negotiate, no risk of tiles collapsing and your (very expensive) equipment falling down into a hole.

Stability
Steel reinforced concrete slabs don’t rattle, shake, shift, or break, …at least under normal circumstances. If your data center is located in an geographic region known for what I like to call “geological entertainment” your data center is likely better off with a solid floor. You can solidly secure all your infrastructure to a solid concrete slab far better than to a raised floor. The stress, shaking, and shuddering of a seismic event can displace floor tiles. The last place I want to be in an earthquake is in a raised floor data center… tiles popping, racks swaying, and the whole floor structure wobbling around underfoot does not make for a confidence-filled rollercoaster ride. I’ve been inside a solid-floor facility in a 7.1 earthquake; the overhead ladder-rack and server racks all moved in unison, creating an eerie wave, but the floor remained solid throughout, much to my relief.

Calculations of point loads and rolling loads become irrelevant, except for maybe your UPS gear if you are off the ground floor of your building.

Fire Suppression
Fire suppression technologies in today’s data center focus on isolation of smaller zones and release of a clean agent to extinguish the fire in that area. If you have a raised floor you instantly double the number of zones you must monitor, and deploy fire suppression systems into. The server spaces as well as the plenum spaces. Zone isolation is achieved through dampers in the air handling system and solid walls. These are trivial to build and secure in a solid floor facility. Air supply and return plenums and ductwork can have automatic dampers driven by the fire suppression system. Try that in a raised floor environment of any scale and it is prohibitively expensive and in some cases just flat out impossible. In the facilities I am involved with the solid floor datacenter is protected by FM-200 and Ecaro-25 fire suppression systems throughout its entirety, whereas the raised floor data center’s fire suppression is limited only to the UPS rooms.

Data center fires are unlikely, but the presence of suppression systems is a requirement for some users of data center facilities. If data centers are kept clean, dust-free, and combustible materials are kept out (almost impossible as the presence of servers is a guarantee of cardboard proliferation!) then risk of fire is low, but it can not be completely eliminated. The under foor plenum spaces are a magnet for the collection of dirt, dust, loose change, and various bits of paper, cardboard, etc. I’ve never seen a raised floor plenum space that wasn’t dirty after a year or so of installation. How many of you have seen fire suppression extended to the plenum space under the floor? What good is it to deploy in one part of the data center and not another?

Cleanliness
The above point leads directly to this one. Data centers should be very clean environments. Solid floor facilities are much easier to maintain to a very high standard of cleanliness. Raised floors are not. Periodic removal of all tiles is required to clean the plenum spaces. This not only is a messy hassle, it also reduces the effectiveness of the cooling systems during the maintenance interval, it also exposes your cabling infrastructure to risk of damage. My car always needs washing, and my wife will tell you I’m a slob, BUT my data centers are clean enough to eat off of… but don’t even THINK of bringing food or drink into one of them! I can stand in my solid floor facility and visually scan for dirt and dust with the efficiency of The Terminator. Not so with a raised floor. Unless it was installed yesterday, all manner of dirt, dust, and debris lurks beneath every raised floor used in actual production. The raised floor advocates will try to deny this, but no raised floor will pass the repeated scrutiny of a white-glove test.

Raised floors also provide a false sense of order. If a single cable is out of place, or some rat’s nest of shameful cabling lies beneath… it is hidden. No difference to the casual observer. The CEO that tours through once a year may not know whether it is the one cable or the rat’s nest, but YOU will… and YOU are the one that has to manage it. Every production facility is under constant change management, and if things go unchecked for even a little while what started as a well-ordered cable plant can turn into a rat’s nest pretty fast. Tracing cables under floor tiles is one of the biggest pains in the posterior any data center manager has dealt with. I have found that with all the infrastructure in plain sight, keeping it in order is at least easier. There are no surprises lurking when everything is in plain sight.

Density and Growth
The reality of high-density computing is that the data center must be able to support far more cable, power, and number of servers-per-rack than ever before. The days of eight 4U servers, a patch panel and maybe a few bits of 1U network hardware in a rack are long gone. Todays racks each need hundreds of cat-5 ports for multiple NICs, various storage connections, etc, room for forty-plus 1U servers, or maybe even a half-dozen blade chassis, and enough power to drive a Tesla Roadster from San Francisco to Seattle. If your raised floor was built even as recently as five years ago there likely just isn’t enough space in your plenum to handle that much cable anymore, at least not without seriously compromising your airflow. Once you build your raised floor, you are locked in to that design. You must peer far into the future and assume infrastructure needs way beyond what is expected today. With a solid floor and overhead infrastructure, you can keep adding network and power without any compromise to cooling or air flow.

At my two facilities, I work with of both raised and solid floor data centers. The raised floor one has hit the limit of what it can power and cool, based on a seven year old design, but it still has empty spaces that will remain unused, forever. The solid floor facility is currently being expanded, while still remaining on-line and operational. It will soon be capable of more than double the watts-per-square-foot its original designers planned for in the year 2000. It’ll be able to pack every rack full to 42U. The cooling system, which originally was giant air-diffusers up in a 15′ ceiling are being modified with ductwork to concentrate cold air right in front of each rack, with hot-air return plenums being routed out of the hot aisles and back into the the HVAC system on the roof. The ladder rack cable trays are not even at 20% of their capacity. This scenario is not possible with raised floor data centers, unless you can shut them down for a complete overhaul.

Access
Contrary to Mr. Alger’s claim, every solid floor data center I have worked in has had power and network terminations within reach of an average sized human being, no stepladders required. In the current solid floor facility I manage, the ladder rack is substantial enough, and the ceiling high enough to enable workers to walk on the structure itself. Ladders are only needed to ascend to it, once up you can walk around the entire facility quite safely, nine feet off the floor. The only time one needs to go up there is to install new cabling, or access the HVAC ductwork, which is rare. Working beneath the floor tiles by comparison is a miserable chore.

Having worked in both environments over the years, I’m leaning towards avoiding raised floor in the future, and sticking with solid floor facilities. To me raised floor stands as an echo of older days, when “The Data center” contained a handful of mainframes, a minicomputer or two, and men with white shirts and pocket protectors loading tapes and sitting at terminals. Entirely raised floor design just does not effectively scale to the density needs of a modern facility. I have seen hybrid facilities with raised floor plenums used solely for cooling and overhead ladder rack for power and network delivery, and that seems like a good compromise to me. But the overall benefits of a solid floor have convinced me to never look back at raised floor except as nostalgia. I suspect that I am in the minority though, as so few people have had the opportunity to experience both options first-hand. Inertia has lead people to only think of data centers in the context of raised floors.

Do you agree? Or do you think I’m wrong? Let me know in the comments.

Assessing the risks of human error in the data center

Choice quote from a report on the converging trends of data center builds and high-density computing:

“Respondents report that the majority of data center outages were caused by human error and improper failover.”

Risk assessment is part of the job of every data center operator. In reality it falls within the purview of every IT professional. Why? Because the more we depend upon technology, the more we miss it when it fails. Most entities today can not function without their IT infrastructure. Information Technology at the desktop level is still quite immature and suffers from a myriad of constant failures, both great and small… most no doubt having to do with the dreaded PEBCAK errors.

In a data center environment however the user is physically removed from the equipment and can not cause errors, right? Well, not really. The data still point to human error, and by extension failover issues in redundant systems. I’d be willing to bet that the failover problems were built-in from the start by human error in either the design or implementation phases.

 Human error is not a very sexy topic of discussion and rarely gets mentioned in the data center realm. Data centers are frequently painted as completely automated, “lights out” facilities where machines hum along in chilled rooms and those pesky error-prone humans are kept out by ever-present security systems such as man-traps, biometric scanners, and in some bizarre cases, armed guards.

This could not be further from the truth. It is an illusion conjured up by data center marketers. While the people per square meter ratios found in your typical office building won’t be found in a data center, human beings are a vital part of the picture and facilities must take into account their presence and their potential for error.

In my adventurous youth, I spent my free time pursuing thrills in high places. I was an Alpinist. Alpinism has a strong literary streak and part of every climber’s possessions is a collection of books: guidebooks, manuals, epic tales of adventures in remote ranges, and yearly journals of climbing organizations.

One such publication is the American Alpine Club’s Accidents in North American Mountaineering. It is a chronicle of human error, and an accounting of its cost. It is required reading for any alpinist as it is a list of things NOT to do while in the mountains. Yes, there are objective hazards in the mountains such as weather, avalanches, etcetera, but reading accident reports you quickly realize that the greatest risk to any climber’s life is their own misjudgment.

We are taught that human beings learn from their mistakes. Smart human beings learn from other people’s mistakes. That is the purpose of Accidents in North American Mountaineering, an accounting of human error, some of it fatal, for smart human beings to learn lessons from. These sorts of post-event objective analysis are not limited to mountaineering either.

The aviation industry has an excellent record of publishing data about accidents, both major and minor. The military of course has an entire bureaucracy dedicated to gathering and analyzing data, both from history and recent combat. They even have a TLA for the process: AAR, for “After Action Report.”

There is no such widely read chronicle of human error for the data center operator to learn from. That is a shame. I know of significant outages that have happened at facilities, but have no idea of their true cause(s), or what could have been done to prevent them. You hear of them through the rumor mill, you can track them sometimes via blogs, but rarely are the exact causes and sequences of error fully reported. I understand why.

Data centers are supposed to be foolproof facilities dedicated to uptime. In reality datacenter operators have to balance the laws of physics; specifically the Second Law of Thermodynamics and the propensity for human beings to make errors. In other words, stuff breaks and nothing is foolproof. However, at least in the colocation business the sales side of the industry quashes the honesty about outages for fear of lost revenue.

I’m looking at sending my facilities staff to AFCOM’s Datacenterworld Conference but in looking over the schedule I see only one session concerning actual outage post-mortem analysis. Lots of marketing-speak about design and maintenance, but no obvious mistakes to learn from. I realize that as commercial enterprises, many data centers would rather not expose their flaws. Remember here though that the whole point of this exercise is not to assign blame, or make the unfortunate subject to ridicule. The object of this is to learn from the mistakes of others for the betterment of the ENTIRE industry.