Continuous, fleet-wide profilers, such as Parca Agent are designed to run safely in production. This requirement brings a set of unique challenges that have informed the design tradeoffs. For example, we want to minimize resource usage of our profiler to affect the workloads we are observing as little as possible. At the same time, running continuously means that large amounts of data may be collected and stored. Despite the vast amount of data produced and the high cardinality of some of the metadata, we must ensure fast and efficient querying as well as responsive visualizations.
Before diving into some of the key design features of our continuous profiling project, let's lay out the basics of profiling and why we believe it's important.
Performance matters
Profilers are essential tools while developing software because they help understand and solve performance problems. There are three broad categories of performance issues: "Known Knowns", "Known Unknowns", and "Unknown Unknowns".
Known Knowns are straightforward and can be easily accounted for. Known Unknowns are potential performance problems that can be planned for, but the exact size of the issue is unknown until it is tested.
Unknown Unknowns are unexpected performance problems that can only be discovered with the help of specialized tooling, such as profilers.
There's no other environment like production
Understanding the performance characteristics of software is not an easy task. There are a lot of variables that can affect how our software behaves, including: heterogenous hardware and software, such as different models of CPU or operating systems, surrounding workloads that can cause priority inversions, cold caches, data that is not representative (quadratic algorithms with very small test inputs might be fast enough to not be noticed) etc.
Profiling locally will give us a narrow view of how effectively a piece of software is utilizing the underlying hardware.
What is continuous profiling?
Continuous profiling refers to the process of collecting performance data from a program on an ongoing basis. By combining the sampling approach with continuously taking profiles, we can get a more complete picture of the execution. Continuous profilers attempt to gather sampled profiling data often so that it is statistically significant with enough data.
Much like with any other observability data, one never knows at which point you will need this data. We believe that collecting as much data as possible and then slicing and dicing yields the highest visibility into our systems.
Why continuously profile?
There are more reasons, but some of the most common are
- Understanding changes in performance: Always collecting data from every process and the kernel allows comparing why the execution of code was different in time across processes or even versions of code. Parca's powerful multi-dimensional model allows comparing profiling data on any label dimension.
- Reducing costs: Insights into the code that causes the most resources to be used allows engineers to work on targeted optimizations and be confident that resource usage will be lower after optimizing.
- Helping troubleshoot incidents: While fixing production incidents performance data can be a lifesaver to understand whether there's some system that's struggling due to lack of resources or if the problem might lie elsewhere. Historical data helps to understand previous incidents to ensure that they won't happen again.
Design goals
We set the following goals to drive our project:
- Low-overhead profiling
- Profiling-oriented data model, including support for large, multi-dimensional analytical queries
- Visualizations and query language that helps us understand profiling data
- Automatic target and metadata discovery
Open standards: Supporting open wire formats
Parca Agent uses the pprof wire format, an open format for profiling data. This allows us to integrate with existing profilers and tools easily.
Moreover, we actively participate in the Open Telemetry Profiling discussions, which aims to standardize the wire format for profiling data. It will allow us to integrate with other tools and profilers that use the Open Telemetry Profiling format.
Storage: Columnar datastore
Since we wanted to support profiling infrastructure-wide, we needed a storage system that could scale to support even the largest infrastructures. However, we also wanted to provide users with the ability to query not only the labels of their infrastructure but also give them the tools to query for specific functions. These requirements informed our decision to build a columnar datastore that uses the Apache Arrow and Parquet projects as the columnar data formats.
In a columnar storage system, data is organized into columns, rather than rows. This means that each column contains all of the values for a specific attribute. Columnar storage can provide a number of benefits over traditional row-based storage systems. One advantage is that it can make it easier and faster to perform data analysis and querying aggregations because the data is already organized in a way that makes it easier to access specific attributes.
Additionally, because columnar storage systems typically compress the data to save space, they can be more efficient in terms of storage and can reduce the amount of data that needs to be read and processed. The columnar format allows us to handle arbitrary cardinality and provide a way to perform effective searches on any of the columns in the database. Which was a key requirement for us.
Check our blog post on the topic: A database for Observability.
Better UX and UI: Scalable Visualizations
UX is key to making sense of any type of data. Visualizations, in particular, are used to represent a large amount of information in a more digestible way. And observability and profiling data are no exception.
We wanted to ensure that we provide a great user experience for our users. Our UI features several options for a user to view his/her profiling data - the metrics graph, flame graph, optimised table view, and callgraph (experimental). On top of this, users have the option to dig deeper by searching and filtering what is visualised by function name and we’ve linked these visualisations together so that a user interaction highlighting a portion of one graph is also reflected in the other visualisations.
But in order for visualizations to be truly effective, it is essential that they are not only visually pleasing but performant. We are continuously working towards matching aesthetics with performance in our visualizations of profiling data. In order to make informed decisions and invest our efforts where needed, we record performance benchmarks and introduce changes with a particular eye to how these changes affect our benchmarks. For example, introducing virtualization for our Table component in the Profile Explorer was able to significantly improve performance and mitigate main thread blocks even for large data merges.
While working on our callgraph, we’re also paying close attention to performance. The callgraph has been particularly challenging due to the complex nature of a layout algorithm that is necessary to render a graph of the series of calls taken to execute a given program - directional, (potentially) cyclical, with large amounts of data, and ideally minimal link and node overlap. Our experimental callgraph (currently hidden behind a feature flag) has already gone through many iterations and we've seen significant improvements. For example, we created an initial version of the callgraph in SVG, but have since moved to rendering in Canvas, due to its performance benefits.
Other than reducing rendering time of a single render of each of our major visualisations, we are also actively using browser dev tools to reduce unnecessary re-renders of components - the other half of the frontend performance battle.
Low-overhead profiling
Leveraging BPF
BPF (short for extended Berkeley Packet Filter, but the "e" is now dropped) is a runtime that is built into the Linux kernel. It allows to write and run programs that run in the kernel, safely.
These programs can be used for a variety of purposes, including network traffic filtering, tracing, profiling, and even infrared decoding!
We greatly leverage BPF to collect stack traces from the kernel as well as userspace without having to modify or manually instrument the application code in any way.
Sampling
We attach our BPF programs to a perf event that executes 100 times every second (100Hz). We are collecting samples periodically rather than, e.g. every instruction executed. This is what makes our profiler a sampling profiler. The reason why many profiles operate in this way is to reduce overhead, but the price to pay is that samples will be missed. This is not a problem if the profiler is run for a long enough timeframe, most of the hot enough code paths will show up.
In-kernel stacks aggregation
Any profiler wants to minimize its overhead, but keeping the footprint as small as possible is even more pressing when running continuously.
One of the ways in which we make our profiler more efficient is by not sending every sample from our BPF program, running in kernel space, to the Go application in userspace. Instead, we use BPF map, an in-kernel data stastructure to aggregate the samples. Every 10s, we read these samples in userspace, produce a profile, and send it to the server.
This map stores the hashes of the user and kernel stacktraces as the key of our aggregated stack, and its value is how many times we've seen this stack.
Walking stacks in BPF
The lack of frame pointers in many native userspace applications makes walking native stacks hard. We've implemented a custom stack walker that runs in BPF rather than in userspace. This allows us to significantly reduce overhead as we don't need to copy the whole process stack from the kernel to userspace. Additionally, this also increases security as the stack might contain confidential information that we want to avoid handling outside of its process.
You can read more about this in our recent blogpost.
BPF enables us to easily walk kernel stacks too, providing a greater level of visibility into our computing stack.
Remote, fully asynchronous symbolization
Symbolization is the process of mapping low-level, machine information, such as addresses to high-level human-readable symbols like functions. Unfortunately, it is an expensive and surprisingly complex task.
By asynchronously symbolizing we improve the performance of the collection and ingestion process. We don't require the debug information to be present when the profiles are sent. They can be added at a later time and the profiles will then be symbolized. We also do it remotely, on the server, to free precious resources in the machine where profiles are collected.
The debug information can be extracted and sent to the Parca Server, to allow for stripped binaries in production. If the executable is publicly accessible, the server will retrieve the debuginfos from a public debuginfod server. This works for binaries and libraries installed from any major Linux distribution's packages.
All the above lift massive responsibilities off of the agent, therefore making it even more lightweight.
Enriched metadata: context is key
To aid in debugging performance, we augment the profiles with additional metadata, such as thread names, compiler versions, kernel version, and a lot more environmental context.
We attach additional information about Kubernetes and Systemd metadata to allows for a richer querying experience.
Conclusion
Developing a production-ready continuous profiler requires work across many stack layers and a wide range of considerations.
We hope this blog post gave you a glimpse of some of the design decisions we've made compared to traditional profilers. If you're interested in learning more, please check out our GitHub repositories: Parca and Parca Agent.
We are actively working on improving the Parca project which is open-source (we use the Apache License 2.0 for the userspace programs and GPL v2 for the BPF ones).
Notes and references
- Google-Wide Profiling: A Continuous Profiling Infrastructure For Data Centers
- Introducing debuginfod, the elfutils debuginfo server
- pprof++: A Go Profiler with Hardware Performance Monitoring
- Golang: Proposal: hardware performance counters for CPU profiling
- Continuous Profiling in Production: What, Why and How