Azure S Project Tardigrade Looks To Improve Reliability And Resiliency

For the unfamiliar, tardigrades are some of the most resilient creatures on earth (and possibly the moon). The eight-legged micro-animals have ancestors as far as 530 million years ago. They have survived all five mass extinction events, can survive in temperatures above 150 Celcius, and can go without food or water for thirty years. Microsoft’s reference to the tardigrade outlines its goals for Azure to survive in several non-ideal conditions. Over the past year, the platform has experienced several large-scale outages. This is a concerted effort for Virtual Machines, ensuring they are safeguarded in the event of platform failures. “Project Tardigrade is a broad platform resiliency initiative which employs numerous mitigation strategies with the purpose of ensuring your VMs are not impacted due to any unanticipated host behavior,” explained Russinovich. “This includes enabling components to self-heal and quickly recover from potential failures to prevent impact to your workloads. Even in the rare cases of critical host faults, our priority is to preserve and protect your VMs from these spontaneous events to allow your workloads to run seamlessly.” During Tardigrade’s recovery workflow, it will first recycle all services running on the host. If that doesn’t work, a diagnostics service collects logs to aid in the diagnosis of a root cause. The OS is then reset to a healthy state, with the states of each VM preserved in RAM and applications in VMs freezing only temporarily, before it returns to a reset state with full functionality. Microsoft is currently using Tardigrade to “catch and quickly recover from potential software host failures in the Azure fleet”. It wants to expand to more failure scenarios while exploring the use of machine learning to predict more types of host failures.

Azure s Project Tardigrade Looks to Improve Reliability and Resiliency - 23