"One cloud provider is reportedly mortified after suffering a data center service outage caused by the mistake of a single administrator.
IDG News Service stated that cloud vendor Joyent noticed an issue with service availability on May 27 and immediately began working to identify the cause and correct it. After a quick investigation, data center operators discovered the source of the problem: One administrator rebooted all of the facility's virtual servers synchronously, thereby causing service to be temporarily unavailable.
Joyent employees were able to repair the issue, bringing all compute nodes back to working order within an hour of first discovering the outage. However, the company is understandably embarrassed that the service disruption happened in this manner.
""It should go without saying that we're mortified by this,"" said Joyent CTO Bryan Cantrill. ""While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a data center.""
The organization is currently looking to gain as many details surrounding the outage as possible - including how many customers were affected, what exactly caused the downed services and how they will work to prevent it in the future - and make a full report. However, IDG News Service contributor Mikael RicknÃ¤s agrees with Cantrill that this type of event should not have occurred in the first place.
""[A]n error of this magnitude shouldn't be allowed to happen, and highlights the importance of processes that balance the need for effective management and protecting users against these kinds of issues,"" RicknÃ¤s wrote.
The Cost of an Outage: Sears Sues After Service Disruption
While Joyent is still investigating, there will undoubtedly be costs incurred as a result of the outage. Such was the case with Sears, whichfiled a lawsuit after two service disruptions during the summer of 2013.
According Crain's, the two outages cost the retailer more than $2 million in lost profits, as their online e-commerce platform was unavailable to customers for a considerable time. Furthermore, the outage - caused by failures in the uninterruptible power supply system - cost the company approximately $2.8 million to repair. The big box company reportedly blames its data center maintenance provider and power-supply manufacturer, both of which were named in the lawsuit.
Top Causes of Data Center Outages and Mitigation Strategies
These cases are real-world examples of two of the three top causes of data center outages, according to industry research. The Ponemon Institute found that the main causes of service disruptions within data centers include:
UPS failures or exceeded capacity
The Ponemon Institute's 2013 Study on Data Center Outages showed that 52 percent of data center operators thought the majority or all ofunplanned downtime could have been avoided. Although these issues happen more often than others, Eric Hanselman, 451 Research chief analyst, told Network Computing that they are often overlooked. However, there are strategies data center organizations can leverage to mitigate their risk of outages coming as a result of these highlighted problems.
The study found that one of the most-utilized approach for preventing downtime is making improvements to equipment. In fact, 49 percent of operators make component upgrades, and Hanselman agreed that this is a beneficial tactic, especially to prevent UPS failure. Investing in UPS redundancy is also a good idea, as the cost of including backup systems is oftentimes less than the cost of an outage.
When it comes to human errors, such as that which occurred in the Joyent outage, it is best to take the human interaction component out altogether through the use of automation. This ensures that systems routinely follow the correct processes without the need for an operator to tell them to do so.
""No organization out there should be manually changing anything more than a very small handful of what its infrastructure consists of,"" Hanselman said. ""Routine tasks, bringing up systems, and configuring and managing them, should all be automated.""
The study also found that 28 percent of operators leverage boosted security and monitoring practices to prevent outages, an advantageous technique for fighting off DDoS attacks. Hanselman recommended focusing on ensuring that the customer experience is protected with these efforts.
Through the use of redundant backup systems, data center automation techniques and improved security practices, data center operators can avoid the frustrations and embarrassment connected with an outage, all the while keeping processes humming along and their customers happy."