Application life-cycle¶

Factor 9 of the 12 factor app states that apps must be disposable and start quickly. This means that apps must perform a graceful shutdown in a way that allows the hosting platform to make the action invisible to the client. Furthermore when running on a PaaS, an app instance must be able to signal to it's hosting platform it is in a healthy state and ready to take traffic.

This doc will explain at a high level the actions which must be taken when engineering an app to run on Cloud Foundry, such that a user or client will not notice an interruption in service during app push, scale, restart and platform upgrades.

What happens when an app starts¶

The start process applies to the following events:

cf push - new instance of a new version of the app are started to replace the old version
cf scale - new instance of the app are started
cf restart (rolling strategy) - all instances of the app are replaced
Platform upgrade - Diego cells follow a rolling replacement

At a very high level the app start process follows:

Schedule decides a instance is needed and instructs the system to start a container
Container starts on a Diego cell (worker)
The Diego cell starts a health check process
Application health check passes
The Gorouter is instructed to add the app into rotation

Application health checks¶

Having a health check which only returns true when the app is ready to serve traffic is critical to ensure that adding a container does not cause a client to receive a HTTP error.

Cloud Foundry support 3 types of health checks:

http - a http request sent to a specific endpoint of the app, with 200 OK expected as the response
port - a TCP can be made on a designated port or ports. This is the default.
process - the process is running. E.g the python interpreter is running

It is recommended to use http as this is the only option that can ensure the app is ready for traffic. In the case of web apps, both port and process health checks will only confirm that the web server is online, but not that the underlying software is able to respond to traffic. In addition should an app become unresponsive a port health check may return even though the underlying app is no longer able to respond to a request.

What happens when an app crashes¶

In the case that an app becomes unresponsive the process is as follows:

The Gorouter will transparently retry other instances, mark an instance as bad if it cannot make a TCP connection and take the instance out of rotation for 30 seconds
The app health check fails
The Gorouter is instructed to remove the app from the routing table
The app is restarted immediately and on restart failure it follows a back-off routine.

In the case that the Gorouter is still able to make a TCP request to an app, for example if a web service is listening, but not able to respond to the request, the Gorouter will continue to send traffic to the instance. To mitigate this it is recommended to to modify the http health check interval below the default of 30 seconds. Depending on the CPU cost of the health check there could be an impact on the platform if the value is set too low.

What happens when an app stops¶

The stop process applies to the following events:

cf push - old instances of an app are stopped on a rolling basis and replaced by a new version
cf restart (rolling strategy) - old instances of an app are stopped on a rolling basis
cf stop - all instances of the app are stopped
Platform scale down - Diego cells are drained and removed
Platform upgrade - Diego cells follow a rolling replacement

At a very high level the app shutdown process is as follows:

The Gorouter removes the app from its routing table, meaning that no new request will be sent, but outstanding request responses will be honoured
The scheduler instructs the Diego cell to stop the app
The container is sent the SIGTERM signal, which the app should treat as a soft shutdown event and gracefully complete outstanding requests before stopping cleanly
If after 10 seconds the container has not exited, Diego then sends a SIGKILL which will terminate all processes

Should there be the need to extend the time that apps are given to shutdown this can be set system wide but will have the effect that Diego maintenance events could take longer.

Each language will have a different way to respond SIGTERM.

Java shutdown¶

Java allows the developer to configure pre-shutdown hooks, to insert logic into the shutdown process.

The default behaviour in Java is as follows:

JVM receives SIGTERM
All pre-shutdown hooks are triggered (if any are defined)
The JVM will then wait for all non-daemon threads to complete before exiting

The last point is critical, as the JVM will not exit until all theads complete, meaning the app should be designed to take this into account.

Spring annotation¶

Spring apps can use the @pre-destroy annoation to ensure a function is called before exiting.

For Java 9+ the following dependency needs to be added.

<dependency>
    <groupId>javax.annotation</groupId>
    <artifactId>javax.annotation-api</artifactId>
    <version>1.3.2</version>
</dependency>

Detecting a SIGKILL¶

If the following line appears in app logs, then it is proof that an app was forcully shutdown by the system after the app did not respond properly to a SIGTERM.

OUT Exit status 137 (exceeded 10s graceful shutdown interval)

Testing app behaviour¶

Should an app team need to test the behaviour to ensure the stop and start events are transparent to a client it is recommended to run cf restart --strategy rolling in a dev environment whilst the app is under load. If the app is coded, configured and scaled correctly, then the operation will be invisible to the client.