Big Enough to Fail? Toward Smarter Data Center Infrastructure Management

Mike Allen
May 30, 2014

One of the unfortunate realities of data centers is that failures can happen. Asset or infrastructure failures may not be frequent, irreparable or outside of the realm of prediction, but they do occur. Environmental risks, cyberthreats and human error can all cause problems. The sheer scale of the requirements of operating a data center and the high number of external factors that can interrupt facility functionality create an incredibly complex management process.

Are data centers big enough to fail? They’re certainly getting bigger – recent MarketsandMarkets projections for the mega data center market predict that it will rise to $20.55 billion by 2019. The causes of data center service interruptions are varied, ranging from those spurred by the uncontrollable forces of nature to downtime induced by critical oversights on the design or management side.

Data Center, Interrupted


The prevalence of unplanned outages is exacerbated by the scope of the possible root causes. A September 2013 report by the Ponemon Institute found that almost all of the 584 individuals working in data center management positions for U.S. companies had experienced an unplanned facility outage in the previous 24 months, averaging two data center shutdowns and 91 minutes of unexpected downtime over that span. The top root causes (with instances in which there was more than one root cause for a single data breach) included:

  1. Uninterruptible power supply battery failure (55 percent)
  2. Human error (48 percent)
  3. UPS capacity exceeded (46 percent)
  4. Cyberattack (34 percent)
  5. IT equipment failure (33 percent)
  6. Water incursion (32 percent)
  7. Weather-related (30 percent)

Among the less frequent root causes of unplanned outages were heat equipment or computer room air conditioning equipment failure, UPS equipment failure, and hardware that was improperly deployed during a move or reconfiguration.

Electricity Outages: A Power Ballad


Power outages continue to be a leading source of unplanned downtime, consternation and unclear resource provisioning decisions. While completely redundant power infrastructure with two separate power sources appears to represent a failsafe solution, it’s also expensive and depends on the availability of local power sources. The Texas Interconnect power grid, for example, the only state-controlled power grid in the country, is one major factor drawing data center companies to facilities in Dallas and Austin.

Power outage problems continue to stymie many data center operators’ efforts to achieve 100 percent availability in their facilities. They may also be on the rise – in the U.K., for example, the number of power outages doubled between 2012 and 2013. According to power management firm Eaton, there were 505 power outages in 2013, with equipment failures and human error leading to the highest number of incidents. The connection between the power grid and the data center, while necessary, is a constant reminder of the fact that data centers cannot exist apart from their environment, and may never achieve factory-quality levels of control and outage prevention.

Data Center Companies to the Rescue


The obvious response to hearing that something is broken is to fix it. This is especially true in data centers, where getting facilities and assets back online is the No. 1 priority. But hastening to repair every equipment failure could be the wrong approach, wrote TechTarget contributor Clive Longbottom. Instead of maintaining a reactive approach to amending outages and fixing faulty equipment, establishing a strategy that incorporates these issues as part of regular data center operation could be a more productive, cost-saving, long-term solution.

“The next generation of data centers could use a fragile IT methodology – knowing that equipment will fail but architecting to accept that failure with minimal human intervention,” Longbottom wrote. “While an N+M (primary plus multiples for backup) architecture requires someone to replace or repair the failed equipment, fragile IT leaves failed equipment in place and ensures it doesn’t draw power or drain the overall system performance.”

The Advantages of Going Modular


Adopting a modular approach to some data center components is one way Longbottom recommended data center operators could change their strategies. Advances in modular technology over the past few years have made it possible to incorporate modular technology at many different degrees. An organization can easily adopt a hybrid model, in which it outfits more traditional data center facilities with modular components. This technique would make it easier to build and operate data centers that can survive an isolated equipment failure.

As data centers take on more servers, network switches and other components in accordance with rising workflows, they will face exponentially rising lifecycle management and replacement costs. Factor in failure rates and it becomes clear that the legacy method is fairly unsustainable. A modular strategy relies on seamless replication and rolling lifecycles, Longbottom wrote, two elements that reduce procurement needs and extend the functional life of equipment. It also uses containerization to optimize heating, cooling and power usage, which staves off the environmental degradation that can cut an asset’s life short.

Can the Hybrid Cloud Help?


Another way to make more use of the concepts behind fragile IT is to eliminate some of the hardware and network assets that are frequent points of failure. The hybrid cloud and several related network management strategies can help rein in the scale of data center infrastructure, provide better management and introduce more effective workaround processes.

In a recent piece for InformationWeek, contributor Ziv Kedem wrote that the hybrid cloud can offer benefits for data center infrastructure management and functionality. Workload inequality, which can be a problem for cloud data centers, can be more seamlessly corrected by the hybrid cloud. The move away from “one-size-fits-all” approaches enables companies to mete out access, networking, infrastructure management and security protocols that better fit the organization’s profile.

“Enterprises will be able to move production applications between clouds without incurring downtime, and without changing configurations between sites,” Kedem stated. “The hybrid cloud will be able to protect and recover applications without the need to purchase storage from the same manufacturer for both production and recovery sites.”

By understanding how data center outages occur and making network and management changes that target their trickle-down effects, organizations can better correct the issues that stem from such massive facilities.



    Mike Allen

    One of the unfortunate realities of data centers is that failures can happen. Asset or infrastructure failures may not be frequent, irreparable or outside of the realm of prediction, but they do occur. Environmental risks, cyberthreats and human error can all cause problems. The sheer scale of the requirements of ...