The wonderfull world of Link Time Optimization – part 1 (the theory)

The need for performing software

In the world of programming performance is not really an issue for most type of applications, even though many software developers seem to think otherwise – premature optimization is the root of all evil, blah-blah, etc.

However, when a piece of software is run by many people, even if it really is not wasting too much CPU cycles, the case can be made – why waste any at all?

So in attempt to have fast and efficient programs, we turn to profiling tools and make measurements and analyse and compare – it is something that some of us actually enjoy a lot – myself included.

Usually any performance optimization should be done after some actual profiling, so you see that after the beginning of request processing on a server there is X CPU cycles wasted in sub-optimal manner.

kcachegrind

Most optimization efforts should be concentrated on more efficient algorithms – since this is where pessimization usually originates from.

However… sometimes we have already made some pretty good optimizations in the code already, but still wouldn’t it be super nice if we could squeeze few additional percent performance, just by using few compiler flags… I will not speak in general about compiler flags in this post, since the topic is epic, but will concentrate on one relatively unused option – LTO, or Link Time Optimization.

What is LTO and why we need it

So let’s examine a simple program that we are compiling:

It consists of the following files:

  • main.cpp
  • somecode1.cpp
  • somecode1.h
  • somecode2.cpp
  • somecode2.h

Let’s say that the somecode* files will have a set of functions that are calling each other.

A typical build will look like this:

main.cpp ==> preprocess ==> compilation  ==> optimization == > main.o

somecode1.cpp ==> preprocess ==> compilation  ==> optimization == > somecode1.o

somecode2.cpp ==> preprocess ==> compilation  ==> optimization == > somecode2.o

main.o + somecode1.o + somecode2.o ==> optimization ==> linking ==> executable

So here we see that the optimization is applied two times – first during the compilation of the individual compilation units, and second during the linkage phase. However since in our hypothetical example we have some interdependence between somecode1 and somecode2 units we are not able to fully make optimizations between them – for example we can’t inline functions defined in the .cpp files.

Many C/C++ library producers build their binaries with a single c/cpp file including all necessary sources, so the optimizer can “see” all the code at once and perform the most efficient optimization. Basically the “poor man’s LTO”.

However, this has a lot of disadvantages, regarding build time and project maintenance. So most compilers have introduced support for this type of optimization.

So how does a LTO build look like:

main.cpp ==> preprocess ==> pre-compilation  ==> main.lto

somecode1.cpp ==> preprocess ==> pre-compilation  ==> somecode1.lto

somecode2.cpp ==> preprocess ==> pre-compilation  ==> somecode2.lto

main.lto + somecode1.lto + somecode2.lto ==> compilation ==> optimization ==> linking ==> executable

Here in the initial phase, we are doing only preprocessing of the source files and some sort of pre-compilation, which basically compiles the code to some sort of intermediate language, with no assumptions about architecture and any attempt of optimization.

After that, when we actually link the final executable, we can do the final compilation to machine code and perform all possible optimizations on the whole program.

In the part 2 of this article I will compare the compiler support for link time optimization.