Health checks

We have mentioned various times that in a distributed application architecture, with its many parts, failure of an individual component is highly likely and it is only a matter of time until it happens. For that reason, we run every single component of the system redundantly. Proxy services then load balance the traffic across the individual instances of a service.

But now there is another problem. How does the proxy or router know whether a certain service instance is available or not? It could have crashed or it could be unresponsive. To solve this problem, one uses so-called health checks. The proxy, or some other system service on behalf of the proxy, periodically polls all the service instances and checks their health. The questions are basically Are you still there? Are you healthy? The answer of each service is either Yes or No, or the health check times out if the instance is not responsive anymore.

If the component answers with No or a timeout occurs, then the system kills the corresponding instance and spins up a new instance in its place. If all this happens in a fully automated way, then we say that we have an auto-healing system in place.