We'll start with the fundamentals: What are dynamic dispatch and devirtualization in Go in regards to interfaces?
In Go, when let's say a function accepts a parameter that's an interface, and a function ends up being called on that parameter, Go needs to first figure out what the concrete function is that needs to be executed, as it doesn't know the concrete type, which is the whole point of interfaces in Go. This is referred to as dynamic dispatch.
Let's have a look at a small example:
type TestInterface interface {
Something()
}
type ConcreteType struct{}
func (t ConcreteType) Something() {}
func main() {
t := ConcreteType{}
AcceptsInterface(t)
}
func AcceptsInterface(i TestInterface) {
for j := 0; j < 1_000_000; j++ {
i.Something()
}
}
This piece of code has a main
function, that instantiates an instance of ConcreteType
, which implements the TestInterface
by defining and binding the Something()
function to it, which itself is a no-op. It passes the instance to the function AcceptsInterface
, which has the parameter i
of type TestInterface
. AcceptsInterface
calls Something()
1 million times on the passed parameter i
. Because AcceptsInterface
doesn't know the concrete type, it has to first figure out what concrete implementation of Something()
every time for each of those 1 million executions.
What's the impact of dynamic dispatch? Let's benchmark it!
Here's a simple benchmark.
func BenchmarkInterfaceCall(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
main()
}
}
Let's run it.
$ go test -bench=. -count=5 -cpuprofile=dynamic-dispatch-cpu.prof -memprofile=dynamic-dispatch-mem.prof | tee dynamic-dispatch.txt
goos: darwin
goarch: arm64
pkg: github.com/polarsignals/go-interface-devirtualization-pgo
BenchmarkInterfaceCall-10 1254 954272 ns/op 13 B/op 0 allocs/op
BenchmarkInterfaceCall-10 1258 959966 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 1240 958960 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 1243 959028 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 1260 956171 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/polarsignals/go-interface-devirtualization-pgo 6.855s
Ok, interesting result, the first run shows some 13 B/op
? Let's have a look at the memory profile.
Alright looks like it was just go test framework things and profiling, the rest don't allocate, so our code doesn't do any heap allocations - phew!
Now the most dramatic way to demonstrate the cost of dynamic dispatch is by type-asserting.
--- i.Something()
+++ i.(ConcreteType).Something()
And rerun the benchmark.
$ go test -run=^$ -bench=BenchmarkInterfaceCall -count=5 -cpuprofile=type-assert-cpu.prof -memprofile=type-assert-mem.prof | tee type-assert.txt
goos: darwin
goarch: arm64
pkg: github.com/polarsignals/go-interface-devirtualization-pgo
BenchmarkInterfaceCall-10 3318 319502 ns/op 4 B/op 0 allocs/op
BenchmarkInterfaceCall-10 3813 319497 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 3820 320011 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 3750 320674 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 3780 320104 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/polarsignals/go-interface-devirtualization-pgo 6.458s
And compare.
$ benchstat dynamic-dispatch.txt type-assert.txt
name old time/op new time/op delta
InterfaceCall-10 958µs ± 0% 320µs ± 0% -66.59% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
InterfaceCall-10 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
InterfaceCall-10 0.00 0.00 ~ (all equal)
Wow! A ~66% improvement. Of course this is a synthetic example to demonstrate the cost of dynamic dispatch, but I think we've shown there is overhead.
When the compiler applies this optimization by itself, then that's called devirtualization.
Note: Don't do this at home, as it's a type assertion that if not successful, it will panic. If you do, only ever use a type switch with a fallback.
Enter Profile-Guided Optimizations (PGO)
Now, wouldn't it be nice if we didn't have to do apply this optimization ourselves? It turns out with Go 1.21 profile-guided optimizations (PGO) is now generally available. PGO can be summarized as using profiling data to inform the compiler to perform optimizations that wouldn't generally be good or known, but thanks to profiling data we know they are possible and will be good.
Let's give it a spin. All we need to do is either have a CPU profile that's called default.pgo
, or pass a file via the -pgo
flag. We'll undo the type-assertion and use the profiling data we took from our previous run.
$ go test -run=^$ -bench=BenchmarkInterfaceCall -count=5 -cpuprofile=pgo-devirtualization.prof -memprofile=pgo-devirtualization.prof -pgo dynamic-dispatch-cpu.prof | tee pgo-devirtualization.txt
goos: darwin
goarch: arm64
pkg: github.com/polarsignals/go-interface-devirtualization-pgo
BenchmarkInterfaceCall-10 2226 478978 ns/op 7 B/op 0 allocs/op
BenchmarkInterfaceCall-10 2520 478064 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 2482 477475 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 2505 478984 ns/op 0 B/op 0 allocs/op
BenchmarkInterfaceCall-10 2488 478541 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/polarsignals/go-interface-devirtualization-pgo 6.528s
And compare to the initial run.
$ benchstat dynamic-dispatch.txt pgo-devirtualization.txt
name old time/op new time/op delta
InterfaceCall-10 957µs ± 0% 478µs ± 0% -50.00% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
InterfaceCall-10 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
InterfaceCall-10 0.00 0.00 ~ (all equal)
Wow nice, we didn't have to modify our code and still got a 50% improvement! Why "only" 50%? Compared to the type-assertion, that would have panic'ed in the event of the concrete type not being the one we assert to, the devirtualization optimizer of course ensures that our code would still function correctly if we didn't have this concrete type.
The way this works is that the Go compiler knows that the concrete implementation is one that's being called in practice, thanks to the provided profiling data, and therefore inserts the type switch to devirtualize automatically.
What's next?
We've learned that dynamic dispatch can have significant cost in Go, but remember never prematurely optimize without measuring that the optmization is worth it. With PGO we can automate it and don't have to think about it or search for the cases where it's worth it. PGO is still very new in the Go compiler toolchain, and while it's impressive already, it's still evolving quickly, so I was happy to see that while I was writing this blog post, a new optimization was implemented, which combines function inlining with devirtualizing.
Lastly, there has always been a bit of a UX issue with PGO, and that is: How can you get representative profiling data from production? The answer is: Use a continuous profiler! And as it so happens the Parca open-source project and Polar Signals Cloud are currently the only documented solutions supporting producing profiling data suitable for Go's PGO.
You can start a free 14-day trial today and try for yourself with our zero-instrumentation eBPF-based profiler, deployment only takes seconds!