Self-Healing Software Systems

AI is finding it's way in a wide variety of applications pertaining to almost every industry. This AI driven rapidly evolving landscape has created a demand for a unique blend of technical, creative, and interpersonal skills highly sought-after by employers. Listed below are some specialized AI-related skills that are becoming increasingly valuable in the modern times. 1. AI Models Development Understanding how AI and ML work including the underlying algorithms, and learning to develop ML powered apps using tools like TensorFlow or PyTorch is a highly desirable skill to master in the age of AI. Furthermore, the skills in fine-tuning and adapting large pre-trained models (like GPT, BERT, or Vision Transformers) to specific use cases are also useful, allowing you to create specialized applications without starting from scratch. Leveraging pre-trained models and adapting them to new tasks with limited data is particularly useful in NLP and computer vision. 2. AI Models Deployme...

Self-Healing Software Systems

A software system which is capable of detecting and correcting its failures without human intervention is called a self-healing software system. Such a software system is highly dependable and fault-tolerant, thus improving quality, reducing cost and bolstering customer trust. A self-healing software continuously monitors any deviations from expected behavior and restores itself to normal operation once any variance is observed.

INTRODUCTION

Regardless of how perfect we design and build our software based system, there will always be some unforeseen / unexpected bugs or failures once the system has been deployed in production. Now as a software architect and a system designer, you have full control to design your system so that it responds quickly and efficiently to the inevitable failures. One of the design alternatives to incorporate resilience in the system is the capability to self-heal or the ability to recuperate from failures.

It must, however, be noted that recovering from failures is often not enough and human intervention is still required. While the system is recovered to its normal state, the cause of failure still needs to be investigated or the bug still needs to be fixed. For this reason, self-healing often goes side by side with investigating the causes of the failure.

COMPONENTS

A self-healing software system typically includes the following two components:

1. A Monitoring Component proactively and continuously monitors the system to check if there is any anomaly or deviation from expected functionality. Few examples include using logging, time-to-live (TTL) or ping to monitor the server, network, CPU, hardware, application performance, exceptions thrown and processes terminated by OS.

2. A Restore Component takes necessary actions to restore the system back to normal operation. Few examples include retry, reboot, fault masking, roll-back, graceful degradation, configuration changes and switching to redundant hardware. A restore action maybe reactive (after a failure has been detected) or proactive (predicting the failure before it has happened).

BENEFITS

Self-healing capabilities in software system helps to save a lot of cost and time required to fix a failure during production. This has become highly critical research domain for big IT companies, where system downtime has a big cost for the business, resulting in loss of customers and reputation. Big companies like Google, IBM and Facebook are investing a lot of revenue in this domain.

EXAMPLES

A monitoring component logs exceptions resulting in application crash and a restoring component restarts the application while also sending a notification to the developers to fix the bug.

A monitoring component logs null pointer exception resulting in application crash, and a restoring component encapsulates the buggy code with a null pointer check, builds, deploys and restarts the application.

A monitoring component considers it a deviation from normal operation if the server CPU usage remains above 80 % for 2 minutes, triggers the restoring component, and the restoring component routes further requests to a redundant server.

A monitoring component sniffs that a network or database connection failed to establish, and the restoring component retries to establish the connection again.

A monitoring component finds that a process has been terminated by Operating System and triggers the restoring component to restart the process again while also sending a notification to the developers to analyze the causes of failure.

A monitoring component might observe that memory usage has reached a critical point of 90% and triggers the restoring component to scale or restart the specific service using the memory while sending the notification hoping that developers would fix the issue.

CONCLUSION

Software applications run on a wide range of hardware ranging from mobile devices to cloud-based clusters. Users of cloud-based applications, such as Google Translate, expect reduced response times and 100% availability. These expectations can not be met by old software architectures. One of the core design principles in modern software architectures is resilience, so that the systems are significantly more tolerant of failure, and when failure does occur, they meet it with elegance rather than disaster. Popular cloud computing service providers like Amazon, Oracle, IBM, Google and Microsoft are investing heavily in self-healing and self-managing technologies for incorporating efficiency in their services.

MALIK UMER BLOG

Search This Blog

Top Skills to Master in the Age of AI