Understanding the Docker cache

One of the main points of confusion when building images is understanding how the Docker layers work.

Each of the commands on a Dockerfile is executed consecutively and on top of the previous layer. If you are comfortable with Git, you'll notice that the process is similar. Each layer only stores the changes to the previous step:

This allows Docker to cache quite aggressively, as any layer before a change is already calculated. For example, in this example, we update the available packages with apk update, then install the python3 package, before copying the example.txt file. Any changes to the example.txt file will only execute the last two steps over layer be086a75fe23. This speeds up the rebuilding of images.

It also means that you need to construct your Dockerfiles carefully to not invalidate the cache. Start with the operations that change very rarely, such as installing the project dependencies, and finish with the ones that change more often, such as adding your code. The annotated Dockerfile for our example has indications about the usage of the cache.

This also means that an image will never get smaller in size, adding a new layer even if the layer removes data, as the previous layer is still stored on the disk. If you want to remove cruft from a step, you'll need to do so in the same step.

Keeping your containers small is quite important. In any Docker system, the tendency is to have a bunch of containers and lots of images. Big images for no reason will fill up repositories quickly. They'll be slow to download and push, and also slow to start, as the container is copied around in your infrastructure.

There's another practical consideration. Containers are a great tool to simplify and reduce your service to the minimum. With a bit of investment, you'll have great results and keep small and to-the-point containers. 

There are several practices for keeping your images small. Other than being careful to not install extra elements, the main ones are creating a single, complicated layer that installs and uninstalls, and multi-stage images. Multi-stage Dockerfiles are a way of referring to a previous intermediate layer and copying data from there. Check the Docker documentation (https://docs.docker.com/develop/develop-images/multistage-build/).

Compilers, in particular, tend to get a lot of space. When possible, try to use precompiled binaries. You can use a multi-stage Dockerfile to compile in one container and then copy the binaries to the running one.

You can learn more about the differences between the two strategies in this article: https://pythonspeed.com/articles/smaller-python-docker-images/.

A good tool to analyze a particular image and the layers that compose it is dive ( https://github.com/wagoodman/dive). It will also discover ways that an image can be reduced in size.

We'll create a multi-stage container in the next step.