The Linux kernel could soon be 50 to 80% faster to build

The Linux kernel takes around 5 minutes (without modules) to build on an Intel Core i5 Jasper Lake mini PC with 16 GB RAM and a fast SSD based on our recent review of Beelink GTi 11 mini PC. Kernel developers may have to build for different targets and configurations, plus all modules so the build times may add up. While it is always possible to throw more hardware to quicken the builds, it would be good if significantly faster builts could be achieved with software optimizations.

That’s exactly what Ingo Molnar has been working on since late 2020 with his “Fast Kernel Headers” project aiming to eliminate the Linux kernel’s “Dependency Hell”. At the time he aimed for a 20% speedup, but a little over one year later, the results are much more impressive with 50 to 80% faster builds depending on the target platform (x86-64, arm64, etc…) and config.

This has been quite a Herculean effort with the project consisting of 25 sub-trees internally, over 2,200 commits, which can be obtained with:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git

1	git clone git://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git

Ingo further explains why reducing header dependency is so hard and many commits are required:

As most kernel developers know, there’s around ~10,000 main .h headers in the Linux kernel, in the include/ and arch/*/include/ hierarchies. Over the last 30+ years, they have grown into a complicated & painful set of cross-dependencies we are affectionately calling ‘Dependency Hell’.
…

When I started this project, late 2020, I expected there to be maybe 50-100 patches. I did a few crude measurements that suggested that about 20% build speed improvement could be gained by reducing header dependencies, without having a substantial runtime effect on the kernel. Seemed substantial enough to justify 50-100 commits.

But as the number of patches increased, I saw only limited performance increases. By mid-2021 I got to over 500 commits in this tree and had to throw away my second attempt (!), the first two approaches simply didn’t scale, weren’t maintainable and barely offered a 4% build speedup, not worth the churn of 500 patches and not worth even announcing.

With the third attempt I introduced the per_task() machinery which brought the necessary flexibility to reduce dependencies drastically, and it was a type-clean approach that improved maintainability. But even at 1,000 commits I barely got to a 10% build speed improvement. Again this was not something I felt comfortable pushing upstream, or even announcing. :-/

But the numbers were pretty clear: 20% performance gains were very much possible. So I kept developing this tree, and most of the speedups started arriving after over 1,500 commits, in the fall of 2021. I was very surprised when it went beyond 20% speedup and more, then arrived at the current 78% with my reference config. There’s a clear super-linear improvement property of kernel build overhead, once the number of dependencies is reduced to the bare minimum.

You’d think kernel maintainers may be wary of accepting such as a large number of patches, but the feedback from Greg KH is rather positive, even though he warned of potential maintenance issues:

This is “interesting”, but how are you going to keep the kernel/sched/per_task_area_struct_defs.h and struct task_struct_per_task definition in sync? It seems that you manually created this (which is great for testing), but over the long-term, trying to manually determine what needs to be done here to keep everything lined up properly is going to be a major pain.

That issue aside, I took a glance at the tree, and overall it looks like a lot of nice cleanups. Most of these can probably go through the various subsystem trees, after you split them out, for the “major” .h cleanups. Is that something you are going to be planning on doing?

The discussion is still in progress with maintainers and how to proceed forwards. So let’s look at the numbers:

 #
  # Performance counter stats for 'make -j96 vmlinux' (3 runs):
  #
  # (Elapsed time in seconds):
  #

  v5.16-rc7:            231.34 +- 0.60 secs, 15.5 builds/hour    # [ vanilla baseline ]
  -fast-headers-v1:     129.97 +- 0.51 secs, 27.7 builds/hour    # +78.0% improvement

Or in terms of CPU time utilized:

  v5.16-rc7:            11,474,982.05 msec cpu-clock   # 49.601 CPUs utilized
  -fast-headers-v1:      7,100,730.37 msec cpu-clock   # 54.635 CPUs utilized   # +61.6%
improvement

# Performance counter stats for 'make -j96 vmlinux' (3 runs):

# (Elapsed time in seconds):

v5.16-rc7: 231.34 +- 0.60 secs, 15.5 builds/hour # [ vanilla baseline ]

-fast-headers-v1: 129.97 +- 0.51 secs, 27.7 builds/hour # +78.0% improvement

Or in terms of CPU time utilized:

v5.16-rc7: 11,474,982.05 msec cpu-clock # 49.601 CPUs utilized

-fast-headers-v1: 7,100,730.37 msec cpu-clock # 54.635 CPUs utilized # +61.6%

improvement

A full build for Linux 5.16-rc7 went from 231 seconds to just 130 seconds with the fast-headers optimization, or around 78% improvement. You may think it’s only useful for people always building from scratch, but incremental builds benefit even more from the headers cleanup:

                                 | v5.16-rc7                      | -fast-headers-v1

|--------------------------------|---------------------------------------
 'touch include/linux/sched.h'    | 230.30 secs | 15.6 builds/hour | 108.35 secs | 33.2 builds/hour
| +112%
 'touch include/linux/mm.h'       | 216.57 secs | 16.6 builds/hour |  79.42 secs | 45.3 builds/hour
| +173%
 'touch include/linux/fs.h'       | 223.58 secs | 16.1 builds/hour |  85.52 secs | 42.1 builds/hour
| +161%
 'touch include/linux/device.h'   | 224.35 secs | 16.0 builds/hour |  97.09 secs | 37.1 builds/hour
| +132%
 'touch include/net/sock.h'       | 105.85 secs | 34.0 builds/hour |  40.88 secs | 88.1 builds/hour
| +159%

| v5.16-rc7 | -fast-headers-v1

|--------------------------------|---------------------------------------

'touch include/linux/sched.h' | 230.30 secs | 15.6 builds/hour | 108.35 secs | 33.2 builds/hour

| +112%

'touch include/linux/mm.h' | 216.57 secs | 16.6 builds/hour | 79.42 secs | 45.3 builds/hour

| +173%

'touch include/linux/fs.h' | 223.58 secs | 16.1 builds/hour | 85.52 secs | 42.1 builds/hour

| +161%

'touch include/linux/device.h' | 224.35 secs | 16.0 builds/hour | 97.09 secs | 37.1 builds/hour

| +132%

'touch include/net/sock.h' | 105.85 secs | 34.0 builds/hour | 40.88 secs | 88.1 builds/hour

| +159%

Builds are up to 173% faster. The main reason for the improvement is the “drastic” reduction of “the effective post-preprocessing effective size of key kernel headers”, some of which are listed below:

 ------------------------------------------------------------------------------------------
    | Combined, preprocessed C code size of header, without line markers,
    | with comments stripped:
    ------------------------------.-----------------------------.-----------------------------
                                  | v5.16-rc7                   |  -fast-headers-v1
				  |-----------------------------|-----------------------------
     #include <linux/sched.h>     | LOC: 13,292 | headers:  324 |  LOC:    769 | headers:   64
     #include <linux/wait.h>      | LOC:  9,369 | headers:  235 |  LOC:    483 | headers:   46
     #include <linux/rcupdate.h>  | LOC:  8,975 | headers:  224 |  LOC:  1,385 | headers:   86
     #include <linux/hrtimer.h>   | LOC: 10,861 | headers:  265 |  LOC:    229 | headers:   37
     #include <linux/fs.h>        | LOC: 22,497 | headers:  427 |  LOC:  1,993 | headers:  120
     #include <linux/cred.h>      | LOC: 17,257 | headers:  368 |  LOC:  4,830 | headers:  129
     #include <linux/dcache.h>    | LOC: 10,545 | headers:  253 |  LOC:    858 | headers:   65
     #include <linux/cgroup.h>    | LOC: 33,518 | headers:  522 |  LOC:  2,477 | headers:  111
     #include <linux/module.h>    | LOC: 16,948 | headers:  339 |  LOC:  2,239 | headers:  122

------------------------------------------------------------------------------------------

| Combined, preprocessed C code size of header, without line markers,

| with comments stripped:

------------------------------.-----------------------------.-----------------------------

| v5.16-rc7 | -fast-headers-v1

|-----------------------------|-----------------------------

#include <linux/sched.h> | LOC: 13,292 | headers: 324 | LOC: 769 | headers: 64

#include <linux/wait.h> | LOC: 9,369 | headers: 235 | LOC: 483 | headers: 46

#include <linux/rcupdate.h> | LOC: 8,975 | headers: 224 | LOC: 1,385 | headers: 86

#include <linux/hrtimer.h> | LOC: 10,861 | headers: 265 | LOC: 229 | headers: 37

#include <linux/fs.h> | LOC: 22,497 | headers: 427 | LOC: 1,993 | headers: 120

#include <linux/cred.h> | LOC: 17,257 | headers: 368 | LOC: 4,830 | headers: 129

#include <linux/dcache.h> | LOC: 10,545 | headers: 253 | LOC: 858 | headers: 65

#include <linux/cgroup.h> | LOC: 33,518 | headers: 522 | LOC: 2,477 | headers: 111

#include <linux/module.h> | LOC: 16,948 | headers: 339 | LOC: 2,239 | headers: 122

LOC stands for Line-of-Code, and you can see that can be slashed with the fast-headers-v1 option. The same thing is true for the “headers” column which represents the number of headers included indirectly. Supported platforms include x86 32-bit & 64-bit (boot tested and main machine), ARM64 (boot tested), as well as MIPS 32-bit & 64-bit and Sparc 32-bit & 64-bit, but those have only been built, and not tested on actual hardware.

The results are impressive, and if the Fast Kernel Headers commits get merged, it could extend the life of existing build farms, and slightly quicken the Linux development process.

Via ZDNet

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.