Why Compiler Function Inlining Matters

December 15, 2021

Let's imagine a super simple program written in Go.
This program simply iterates to 1000 and adds the numbers onto the result.

package main

func main() {
	var result uint64
	for i := uint64(0); i < 1_000; i++ {
		result += add(result, i)
	}
}

//go:noinline
func add(a, b uint64) uint64 {
	return a + b
}

Note: This is similar in other compiled languages, we just use Go as an example.

The comment `//go:noinline` tells the compiler not to inline this function. How does it perform? Let’s see with this simple benchmark.

package main

func BenchmarkAdd(b *testing.B) {
	for i := 0; i < b.N; i++ {
		main()
	}
}

We run this benchmark function by running
`go test -bench=BenchmarkAdd -count=10 | tee BenchmarkAddNoInline.txt`

goos: linux
goarch: amd64
pkg: github.com/polarsignals/inlining
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkAdd-24    	  835048	      1825 ns/op
BenchmarkAdd-24    	  859546	      1606 ns/op
BenchmarkAdd-24    	  856646	      1909 ns/op
BenchmarkAdd-24    	  855582	      1715 ns/op
BenchmarkAdd-24    	  856621	      1431 ns/op
BenchmarkAdd-24    	  845157	      1545 ns/op
BenchmarkAdd-24    	  765014	      1466 ns/op
BenchmarkAdd-24    	  812818	      1441 ns/op
BenchmarkAdd-24    	  787130	      1496 ns/op
BenchmarkAdd-24    	  867459	      1456 ns/op
PASS
ok  	github.com/polarsignals/inlining	18.092s

On its own, this doesn't tell us much. Therefore, we want to compare this against a benchmark run that has the `add` function inline. By removing the `//go:noinline` comment the compiler should inline this function. Let's run the benchmark again:
`go test -bench=BenchmarkAdd -count=10 | tee BenchmarkAddInline.txt`

goos: linux
goarch: amd64
pkg: github.com/polarsignals/inlining
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkAdd-24    	 2404077	       464.0 ns/op
BenchmarkAdd-24    	 2358136	       485.2 ns/op
BenchmarkAdd-24    	 2420811	       464.9 ns/op
BenchmarkAdd-24    	 2429461	       477.0 ns/op
BenchmarkAdd-24    	 2376388	       459.8 ns/op
BenchmarkAdd-24    	 2396380	       483.8 ns/op
BenchmarkAdd-24    	 2533822	       476.6 ns/op
BenchmarkAdd-24    	 2457052	       460.3 ns/op
BenchmarkAdd-24    	 2430799	       488.9 ns/op
BenchmarkAdd-24    	 2432954	       467.3 ns/op
PASS
ok  	github.com/polarsignals/inlining	16.517s

Interesting!

Go has a little helper tool called benchstat that we can use to compare these results.

name    old time/op  new time/op  delta
Add-24  1.59µs ±20%  0.47µs ± 3%  -70.25%  (p=0.000 n=10+10)

It seems that for this example program inlining the add function makes a huge difference. Why is that?

Why does inlining exist?

When you call a function in your program the compiler must emit a few extra instructions to actually make that function call happen. Specifically, depending on the function call ABI, the compiler will pass function arguments either on the stack or via CPU registers. Following that, the return address of the function we're about to call must be pushed onto the stack so we can continue where we left off before calling that function.
Finally, a jump (or similar) instruction must be used to begin executing the called function. When the function call returns we must reverse that process a bit by restoring the caller's stack frame and reading return values off of the stack or CPU registers. This extra overhead is relatively small, but when calling a function in a tight loop for example it can really add up. Inlining functions removes this overhead by simply "inlining" or copying the instructions the function would normally execute directly into the function that calls it.

If this sounds like a lot of overhead for this small function, then you're right.

Inlining has some other nice properties to it as well: It helps with fetching instructions to execute from memory and has better CPU cache properties since the instructions are contiguous as opposed to having to be fetched from memory and then executed.

Inlining

Since the function has no side effects Go decides to inline this function. We can check this by compiling the program with some flags: `go build -gcflags -m main.go`

# command-line-arguments
./main.go:10:6: can inline add
./main.go:3:6: can inline main
./main.go:6:16: inlining call to add

We can compare the assembly with and without inlining by adding and removing that `//go:noinline` comment.

On the compiled binary we can run the `go tool objdump main | grep main.go`

TEXT main.main(SB) /home/metalmatze/src/github.com/polarsignals/inlining/main.go
  main.go:3    0x4553e0    493b6610      CMPQ 0x10(R14), SP
  main.go:3    0x4553e4    7651          JBE 0x455437
  main.go:3    0x4553e6    4883ec28      SUBQ $0x28, SP
  main.go:3    0x4553ea    48896c2420    MOVQ BP, 0x20(SP)
  main.go:3    0x4553ef    488d6c2420    LEAQ 0x20(SP), BP
  main.go:3    0x4553f4    31c0          XORL AX, AX
  main.go:3    0x4553f6    31c9          XORL CX, CX
  main.go:5    0x4553f8    eb2b          JMP 0x455425
  main.go:5    0x4553fa    4889442418    MOVQ AX, 0x18(SP)
  main.go:6    0x4553ff    48894c2410    MOVQ CX, 0x10(SP)
  main.go:6    0x455404    4889c3        MOVQ AX, BX
  main.go:6    0x455407    4889c8        MOVQ CX, AX
  main.go:6    0x45540a    e831000000    CALL main.add(SB)
  main.go:5    0x45540f    488b4c2418    MOVQ 0x18(SP), CX
  main.go:5    0x455414    48ffc1        INCQ CX
  main.go:6    0x455417    488b542410    MOVQ 0x10(SP), DX
  main.go:6    0x45541c    4801c2        ADDQ AX, DX
  main.go:5    0x45541f    4889c8        MOVQ CX, AX
  main.go:6    0x455422    4889d1        MOVQ DX, CX
  main.go:5    0x455425    483de8030000  CMPQ $0x3e8, AX
  main.go:5    0x45542b    72cd          JB 0x4553fa
  main.go:8    0x45542d    488b6c2420    MOVQ 0x20(SP), BP
  main.go:8    0x455432    4883c428      ADDQ $0x28, SP
  main.go:8    0x455436    c3            RET
  main.go:3    0x455437    e824ceffff    CALL runtime.morestack_noctxt.abi0(SB)
  main.go:3    0x45543c    eba2          JMP main.main(SB)

TEXT main.add(SB) /home/metalmatze/src/github.com/polarsignals/inlining/main.go
  main.go:12    0x455440    4801d8       ADDQ BX, AX
  main.go:12    0x455443    c3           RET

As you can see at the end there is our add function with two lines of assembly for it. We can also see the assembly call to `CALL main.add(SB)` that invokes the function. Now, if we let the Go compiler inline the add function we get the resulting assembly:

TEXT main.main(SB) /home/metalmatze/src/github.com/polarsignals/inlining/main.go main.go:3 0x4553e0 31c0 XORL AX, AX main.go:5 0x4553e2 eb03 JMP 0x4553e7 main.go:5 0x4553e4 48ffc0 INCQ AX main.go:5 0x4553e7 483de8030000 CMPQ $0x3e8, AX main.go:5 0x4553ed 72f5 JB 0x4553e4 main.go:8 0x4553ef c3 RET

As you can see now, there is no CALL to main.add(SB) anymore and instead it all happens within the main.main, which means that the overhead of calling the function add is gone.

Function inlining and profiling

These inlined functions basically disappear as their own function calls in the compiled binaries, yet, as humans, we don't necessarily know this, so it's important to be able to differentiate them in profiling data analysis.

In pprof each Function has a Location and Line that reference the function itself. Inline functions are thus at the same Locations, however, have their own Line (think about it, these functions are still on a different source code line) and then point to their own function. More on the pprof internals can be found in our previous “DIY pprof profiles using Go” blog post!

Rendering these inlined functions is done by showing them as part of the stack trace and essentially “squeezing” them in between the other functions.

Here you can see a part of a Prometheus goroutine stack trace. The `waitRead` function was inlined and is shown like any other function.

Rendering a flame graph with inlined functions

Each profile within Parca, which is a continuous profiling project for applications and infrastructure, needs to be rendered as an icicle graph, which means that we need to walk all stack traces of a profile and create a tree data structure from these individual stack traces merging at the root and inserting the individual stack traces as individual trees onto the existing tree.

It becomes quite a challenge with inlined functions to render them properly in Parca’s icicle graphs. Basically, while merging the new stack trace tree, each inlined function becomes its own subtree of stack traces again that have to be correctly merged into the existing tree too.

Finally, our implementation handles these cases correctly since we merged out Pull Request: https://github.com/parca-dev/parca/pull/485

Roadmap for inlined functions

Currently, we don’t show the inlined functions in any specific way. What do you think, reader, would you want us to handle these more specifically in the icicle graphs? Is it fine for you to simply show them as "normal" functions?