February 7th 2019 and the data center industry experienced a major downtime event. Based on press reports, a fire suppression system at one of our nation’s largest banks was activated during utility work. As a result, customers were unable to access ATMs, or their online and mobile banking accounts. But data center downtime is preventable.
Wells Fargo said: “We’re experiencing system issues due to a power shutdown at one of our facilities, initiated after smoke was detected following routine maintenance. We’re working to restore services as soon as possible. We apologize for the inconvenience.”
Fire suppression system issues have taken data centers out before – including an ING Bank facility in Romania, where the noise of escaping gas was loud enough to damage spinning hard drives.
https://www.datacenterdynamics.com/news/widespread-wells-fargo-outage-blamed-data-center-fire/
Prevent Downtime As Megawatt Demand Sky Rockets
As the demand for megawatts stack up, the exposure to human error is on the rise. How can the data center industry avoid these expensive and disruptive failures and establish systems to prevent human error?
The good news is that other high-risk industries have already successfully developed techniques to significantly reduce and prevent human error like this.
Take the airline industry, for example. The graph below shows a significant decline in fatalities despite a sharply increasing number of yearly flights.
What is their secret?
WHEN KEEPING THE BUSSES POWERED ISN’T OPTIONAL
Jet airliners and data centres have to keep the busses powered when providing service to customers. Both are highly complex machines with redundant systems and increasing amounts of automation to help mitigate risk.
Q: However, neither industry is close to operating and maintaining these complex systems without human interaction. Airlines are very mature in minimizing the risk of human error in the loop. How is this accomplished?
THREAT ERROR MANAGEMENT bring it where you are
A: The current methodology is called Threat Error Management (TEM). In simple terms, TEM is accomplished by placing valid barriers between human interaction and uncorrected errors that result in an undesired end state or incident.
Following are several of the barriers that prevent human error that should be present in complex systems:
- Training. Documented, accurate and provable. Prepare teams for all tasks that must be completed. Recurring training must be ranked, using a methodology that ensures teams remain current to accomplish all reasonably foreseeable tasks.
- Proper delivery of procedures (MOPs, EOPs, SOPs etc), ensuring that assigned tasks can be completed without error. Data collected should be verifiable and include embedded pictures and instructional videos for complex and low-frequency/high consequence actions.
- Automation beyond automatic switching to mitigate loss. (Often automated systems are isolated to allow maintenance). The digital procedures system should automatically prevent missing or out-of-order completion of steps.
- External resources. When an airliner is airborne, crews can reach out to additional experts when required to establish an acceptable safety margin. Data centers could also benefit from providing such teams under time pressure. For example, a system just went to UPS and the backup source did not pick up the load – time is of the essence. X-Company’s Enterprise Edition Glass can provide a first-person view to a remote team to help solve problems before they turn into costly failures.
- Expert Operational Experience the last barrier to error mitigation is the team involved. In some instances, the barriers listed above may not alleviate the problem, and the operations team must improvise to mitigate unanticipated anomalies. While this last barrier is valid and very important, it should not be the sole barrier. Any system that relies solely on humans not making errors will fail, and often catastrophically so.
Resistance to change is natural
Humans have a natural resistance to change and often believe they are trained well enough or experienced enough to never make errors.
Leadership must create an environment where constant improvement and professionalism are valued, and change is accepted if it improves operations.
ICɅRUS Ops digital checklist and provable compliance system uses the very best techniques, mobile devices, software and wearables to help you mitigate the expensive risks associated with human error. Combined with our expert LMS system, your team will have an invaluable tool at their disposal to prevent human error before it occurs.
Please give us a call or drop an email to info@icarusops.com to set up a demonstration.