Comparing Profiles is even more helpful now!

August 6, 2024

The power of continuous profiling comes from continuously profiling in production. It enables engineers to travel back in time and understand the impact of a query, a rollout, or the system's overall performance in the real world with actual load.

Comparing these profiles enables engineers to understand what happened before and after an event. That means, for example, comparing two profiles at a certain point in time specifically or merging profiles across time ranges, like one hour before a rollout and one hour after the rollout, and then comparing the system's performance. You can identify performance regressions introduced by recent code changes, pinpoint areas consuming excessive resources, and measure the impact of optimizations. This comparative analysis empowers engineers to make data-driven decisions and proactively address performance bottlenecks.

What we had so far

Absolute Comparisons

Until now, Polar Signals Cloud and Parca have been comparing all values as they are. We call this "absolute comparisons".

Absolute comparisons work well when comparing anything comparable because they use the same duration, frequency, and time range. An example would be snapshot profiles, like memory allocation, memory heap, and goroutines. It also works for CPU profiles, but as these are not snapshots but delta profiles (profiled over time, like 10 seconds), there are more subtleties at play.

With CPU profiles, on the other hand, we can merge individual profiles into bigger profiles that span minutes, hours, or days. Imagine merging all CPU profiles for 5 minutes and another merge of all profiles for 1 hour or merging all profiles of an older version and then merging all profiles of a newer version of your application. Given that we have a program doing the same thing during that period, we would have 60 minutes compared to 5 minutes. We end up with values 12 times bigger, as `60min / 5min = 12`, for the one hour profile compared to the five minute profile.

Animation explaining absolute profile comparison

Compared to the base, the functions main, func1, and func2 have all been seen more often. Only func3 hasn't been seen at all.

Absolute comparison of a one-hour and five-minute parca-agent merged profile

Notice that the runtime.goexit is green, which means our compare profile was the five-minute profile, which has fewer overall samples than the base one-hour profile. All this tells us that within five minutes, fewer resources were used than during one hour.

This could be more helpful. We want to compare the profiles, regardless of whether they represent five minutes or five days.

So what's new?

Relative Comparisons

Polar Signals Cloud and Parca (it's in the next release) now support relative comparisons. Relative comparisons are the default for delta profiles (usually CPU samples and CPU nanoseconds).

The smaller profiles, which have a smaller cumulative value at the root, get scaled up to the exact same cumulative value as the profile with the higher cumulative value.

Imagine, a CPU samples profile where we had 100 samples and another one that had 150 samples. All of the values for the profile with 100 samples, get multiplied by `1.5`, because `150/100 = 1.5`.
Comparing the profiles now, we know that the cumulative root is always going to be exactly the same, and the difference between each span of the profile is relative to each other.

Animation explaining relative profile comparison

With relative profile comparison, we now first check which profile has the higher main value. All values of the profile with the smaller main value get multiplied by the ratio between the two profiles. In the end, the comparison is still the same.

Relative comparison of a one-hour and five-minute parca-agent merged profile

Note that the root is blue, which means that the root of both the one-hour and five-minute profiles have the same cumulative value. The five-minute merged profile has been multiplied by a (roughly) factor of 12 to make the relative comparison possible.

We can now see that most of the parca-agent has used less CPU (green), whereas other parts (red) and sub-systems have been using more CPU.

Overall, beneath the `runtime.goexit` we can read from the tooltip that the system has used 3% in the last five minutes compared to the full last hour.

Final Thoughts

It's possible to better understand the relation between profiles, regardless of the time range or the profiling frequency.
It's clearer to grasp where resources are spent.

We hope this will help you make the best use of your profiling data to get better insights. We are always eager to simplify your continuous profiling journey with Polar Signals Cloud (14-day free trial), and if you have any feedback about this feature, we would be happy to hear about it in our Discord community.

Discuss: