Recently, I have been toying with random pixel art generator, a Next.js app that creates pixel art with random numbers generated from a user-provided seed value. Basically, it generates an array of random numbers with the given seed, converts each of those numbers to a color(R/G/B), and creates the pixel art image for the user.
In case you are interested, you can play around with the Vercel deployment here: https://random-pixel-art-generator.vercel.app/ :)
Internal Kubernetes deployment
Before the public Vercel deployment, I had this app deployed on our internal Kubernetes cluster that came with a set of useful features by default. Some useful features are:
- Prometheus monitoring for the app pods and service metrics
- Grafana dashboard for metrics visualization
- Alerts when metrics are below a defined value
- And, of course, Parca for continuous profiling of all app processes
Background on profiling Node.js/Next.js apps with Parca
One of the best things about using Parca for profiling your Node.js/Next.js apps is that you don't need to make any changes to your application code; you don't need to integrate any new SDKs into your app. Instead, you just need to pass `--perf-basic-prof` flag to the `node` command while running your app. For example:
node --perf-basic-prof src/main.js
or
"NODE_OPTIONS='--perf-basic-prof' next start"
This flag is provided to instruct `node` to emit symbol mappings for the code frames in memory so that Parca Agent(our eBPF based profiler) can use those mappings to decode the profile data and present a meaningful flame graph to the user. If you are curious about how Parca Agent's eBPF profiling works, please check out this article.
Note: This flag doesn't affect the performance of our application in any way, this is just a low overhead way to get profile mappings from node.
Okay, so I deployed it on the Kubernetes cluster, shared the link with a group of people and everything was good, until this happened:
Got paged for a spike in the Pixel Generator app API calls.
Triaging the issue
I got paged that the pixel generator API's 99th percentile response time was more than a second. Moving further, by looking at the metrics dashboard, it was pretty clear that the CPU usage on the Next.js app pod was high, corresponding to the 99th percentile response time spikes.
Before we go further in triaging the issue, let's take a look at our Next.js app's `/random-pixels` API that is used to generate the pixel art:
// src/pages/api/random-pixels.jsexport default function handler(req, res) {const {seed, limit} = req.query;if (limit > 40000) {throw new Error('limit must be less than 40000');}let results = [];const rnd = seedrandom(seed);for (let i = 0; i < limit; i++) {const newRnd = rnd();results = [...results, newRnd];}res.status(200).json({results});}
The handler function is pretty simple, it just generates random numbers and returns them to the client. It also has preventive checks to ensure that the limit is not more than `40000` so that the server is not impacted by requests with unreasonably large values.
To get more idea on what was happening on the pod, let's go to the Parca dashboard and check the profile data for the Next.js app during the time when the response time spiked. Remember? All the apps in the deployed Kubernetes cluster were continuously profiled by Parca, so now we can just go back in time and see what happened with just a few clicks, yayy!
Finding the root cause with Parca's profile data
Once on the Parca dashboard, let's go straight to the problem by filtering the profile data for the following:
- Add the filter `container="random-pixel-art-generator-nextjs-app"` to get the profiles from the Next.js app pod.
- Enter the required time range during which the response time spike occurred.
Profile Explorer with filtered profiles
Now we get the required profile data for our Next.js app. Since it is a CPU usage issue that we are trying to triage, let's merge all profile data using the "Merge" button.
Here is the screenshot of the merged flame graph:
Screenshot highlighting the function calls with high CPU usage
Paying the JavaScript Immutability Tax
By looking at the flame graph above, we can see that 90% of the total CPU time spent by the API `handler` function, is for executing `Builtin:GrowFastDEoubleElements` and `Builtin:IterableToListWithSymbolLookup` functions. From the names it is pretty clear that these functions were related to cloning an array and more specifically a result of the array spread operation that happens on the 9th line of the handler function `results = [...results, newRnd]`.
The JavaScript `spread` operator is one of the most useful features and it is so ubiquitous nowadays to the level that we could tend to misuse it. Even though the `spread` operator comes in handy for our immutability needs, it comes with some performance overhead which is not noticeable on small arrays but on large arrays(like the ones with >25000 elements or so) will be noticeable. That is the price we pay for data immutability in JavaScript.
But in our code above, we don't actually need immutability on the array, since the variable is used purely within this local scope. So to address the unnecessary CPU usage involved in cloning the array over the `for` loop, we can just get rid of the `spread` operator and use `Array.push()` function.
The resulting code would look like this:
// src/pages/api/random-pixels.jsexport default function handler(req, res) {const {seed, limit} = req.query;if (limit > 40000) {throw new Error('limit must be less than 40000');}let results = [];const rnd = seedrandom(seed);for (let i = 0; i < limit; i++) {const newRnd = rnd();results.push(newRnd);}res.status(200).json({results});}
Validation
Once the fix is deployed, I tested the API endpoint with `limit` values close to `20000` and observed a response time of less than one second.
And we can best validate this with Parca's profile comparison feature. It works like `git diff` but for profile data. This can be done by the following:
- Click on the "Compare" button on the Parca dashboard. The profile explorer now splits into two panes.
- On the left, select the time range as the one when we had the response time spike.
- On the right, select the time range as the one when we ran the test requests with high `limit` parameters after the fix.
Here is what we get:
Comparison of the profiles before and after the fix
The darker the green in the comparison, the better it has got. And on hovering the mouse over the parts of the profile diff, we can see close to 90% reduction in the CPU usage.
Diff that highlights 90% reduction in the CPU usage
It is always good to see that your fix has worked! 🎉
Do you also run Next.js/Node.js apps on Kubernetes? Then you should definitely start looking into Parca for continuous profiling your apps if you haven't already. It can help you understand the performance of your app and find the root cause of performance bottlenecks, with just a few clicks.
If you have any questions on how to use Parca for profiling your Next.js/Node.js apps, please come join our Discord community server. You can also star the Parca project on GitHub, and contributions are welcome!