Emerging Technologies Operational Excellence

Self-healing Systems Technology

McKinsey estimates that companies across all industries are already automating 50 to 70 percent of tasks, driving both operational and cost efficiencies. No wonder. IPA promises to reduce complexity, replace manual processes, and improve both organizational performance and end-user experience. Companies that lag may soon find themselves in catch-up mode and at an increasing (and perhaps enduring) competitive disadvantage.


Organizations are tapping into Artificial Intelligence (AI) technologies for managing service availability and quality of their IT operations. A preemptive and automated approach to service management using AI technologies will create powerful self-healing systems. Self-healing systems will reduce failures helping improve business value and user experience.

What are Self-healing Systems? A self-healing system can do the following without any human intervention:

  • discover that it is not operating correctly or performance may be at risk
  • make necessary adjustments to restore itself to the desired state
  • adapt proactively to changed conditions

Self-healing systems are a part of bigger autonomic computing. Autonomic computing is a term used by IBM to describe the need to shift the burden of managing IT systems from IT professionals to the systems themselves. The term comes from the autonomic nervous system of the human body, the system that regulates your body’s basic functions without your conscious awareness. For instance, when you need to run to catch a train, you don’t need to consciously decide to produce adrenaline, reallocate oxygen to the muscles in your legs, and increase your heart rate. Those important and necessary physical adjustments are handled for you automatically. Similarly, autonomic systems handle more and more tasks on their own, without the need for intervention on the part of the IT staff. Autonomic computing behavior is necessary for building effective on-demand operating environments that adapt and adjust quickly to the changing computing needs of organizations.

IBM, for example, is working on an autonomic computing initiative that the company defines as providing products that are self-configuring, self-optimizing, and self-protecting – as well as self-healing. For all of these characteristics together, IBM uses the term “self-managing.”

Why is this important? Hourly downtime costs continue to increase for all businesses irrespective of size or vertical market. This trend has been evident over the last five to seven years. ITIC’s latest 2019 Global Server Hardware, Server OS Reliability Survey, which polled over 1,000 businesses worldwide from November 2018 through January 2019, found that a single hour of downtime now costs 98% of firms at least $100,000. And 86% of businesses say that the cost for one hour of downtime is $300,000 or higher; this is up from 76% in 2014 and 81% of respondents in 2018 who said that their company’s hourly downtime losses topped $300,000. Additionally, ITIC’s latest 2019 study indicates that one-in-three organizations – 34% – say the cost of a single hour of downtime can reach $1 million to over $5 million. These statistics are exclusive of any litigation, fines, or civil or criminal penalties that may subsequently arise due to lawsuits or regulatory non-compliance issues.

This is not a surprising stat as more and more service (online or offline) depends on technology these days. Organizations are leveraging data to make real-time decisions and push contextual and real-time offers to customers. On customers’ part, they expect instant fulfillment of services. Latency of a few milliseconds may have an impact on revenue and customer experience. So any downtime, will not only increase operational costs but also have an adverse impact on revenue.

Reactive self-healing systems trigger automated corrective actions once they recognized a failure. They are still better than human-dependent corrective actions in terms of mean-time-to-repair (MTTR).

Preventive self-healing systems recognize that something might go wrong and trigger automated actions to prevent failures. They help in increasing mean-time-between-failures (MTBF).

There are multiple levels of self-healing capabilities – (a) application-level self-healing focuses on building fault-tolerant apps, (b) overall system-level self-healing focuses on tbd, and (c) hardware-level self-healing focuses on tbd though it is becoming less of relevance with the increased use of the cloud services.

What are the essential elements of a self-healing system? Developing a self-healing environment for IT operations requires flexibility in infrastructure, specialized tools, and AI algorithms.

Infrastructure should provide flexibility to deploy services on any server (that meets requirements) without affecting other production services. Auto-scaling and de-scaling is another important aspect.

Specialized tools for:

  • Cluster orchestration (e.g., Mesos, Kubernetes, Docker Swarm)
  • Real-time registry (e.g., Apache Zookeper, etcd, Consul)
  • Historical registry (e.g., elactic)
  • Monitoring (e.g., Nagois, icinga, Consul, kibana)
  • General orchestrator (e.g., Jenkins)

IT Service Management solution providers like IPSoft, ServiceNow, etc. are embedding more and more automation and AI capabilities into their solutions.

What are the use cases? Some of the use cases include:

  • Switching to redundant networks automatically without any human intervention or trigger
  • Allocating computing and storage resources proactively by looking at usage levels and historical trends so that services continue to perform at the normal level
  • Companies like IBM are researching AI algorithms that can proactively monitor networks, predict a network failure or performance issue, and fix it automatically. This will be especially important for “mission-critical” applications, which could be at risk of going down for as many as four hours until IT staff can fix the issue, said Ruchir Puri, chief scientist at IBM Research.
  • Adobe Inc. uses an AI-based program, developed using open-source technology, automating about 25 core IT tasks that were previously done by employees. One example is fixing failures in data-batching. The software was able to reduce the average time to correct a data-batching failure to about 3 minutes from 30 minutes. The software can also detect whether a specific business application an employee is using is close to crashing and automatically increase the computing or storage capacity, so the application continues to run.
  • Hitachi Vantara launched a project that leverages AI, real-time analytics, and sensors to monitor, analyze, and self-correct temperature and airflow in data centers. The initiative has saved 38% in annual data center costs, on average, while boosting storage capacities.