A Walk with LuaJIT

Tommy Reilly

November 13, 2024

eBPF

The following is a chronicle of implementing a general purpose zero-instrumentation BPF based profiler for LuaJIT. Some assumptions are made about what this entails and it may be helpful to read some of our other work in this area. One major change from prior efforts is that instead of working with the original Parca unwinder we are now working with the OpenTelemetry eBPF profiler. If you missed that announcement you can read about it here. The TLDR is that a tiny BPF program scrapes the minimal information needed from the stack and passes it off to a userland program that flushes out all the required scaffolding and metadata extraction.

Lua is a dynamic general purpose scripting language that's been around since the 90s. Lua is very small, by way of comparison a stripped Lua binary is only ~230kb on AMD64, while a recent version of python is over 5mb! Lua is commonly used for game scripting but you'll also find it in text editors like Lite and Howl and even commercial image editing programs like Adobe Lightroom (Classic). Another example is Redis/Valkey which supports Lua scripting as an escape value for keeping the database simple yet extensible ("hey I want feature XYZ, no way buddy, use Lua!"), if some niche feature can be done via a Lua script it doesn't need to be built-in. Similar to how a browser uses JavaScript and Emacs uses LISP, these programs use Lua as a dynamic plugin framework so that the programs can be extended and customized in a much easier and safer way than using C/C++. But that's Lua, we're here to talk about LuaJIT!

Polar Signals' interest in LuaJIT stems from its use in the OpenResty web framework which combines Nginx and LuaJIT with a suite of modules for connecting to various back end systems and doing general dynamic web application work. You may be using LuaJIT and OpenResty without knowing it already if you use the Apache APISIX API gateway. LuaJIT came on the scene in 2005 when Mike Pall released the first version. This was back in the dynamic language interpreter dark ages before v8, SpiderMonkey, SquirrelFish, Tamarin et. al when there was some question about whether a JIT for dynamic languages made any sense. Java had shown JITs can do amazing things for static languages but like what's the point of generating code to use an add instruction to add two integers when you have to dedicate hundreds of instructions to hash table lookups and type wrangling to even get the two integers into registers in the first place? Turns out you can get rid of a lot of dynamic variable lookups and make educated guesses about types and have fallback code handle the edge cases and get massive speedups. LuaJIT also has a high performance interpreter and it sports an extremely powerful FFI layer. LuaJIT can be 6 to 20 times faster than Lua on simple benchmarks. Performance is probably close to what you'd get with a modern JavaScript engine. However the most unique thing about LuaJIT is the tracing JIT.

This one goes to 0xB

LuaJIT isn't a traditional (i.e. method based) JIT like you'll find in the JVM or CLR. A trace based JIT has machinery to detect frequently repeated loops and when one is detected it starts to record instructions (by storing instructions executed in a side buffer) and then when the loop completes JITs that "trace" of instructions and jumps to the JIT'd code instead of interpreting. The trace can traverse many functions and all the branches not taken are left out of the trace. The trick is the trace just needs to make sure if any of the conditions on the hot path change then it has to exit the trace and fall back to the interpreter. This basically boils down to aggressive inlining but only the parts of the code that get executed, not like traditional inlining where whole methods are lifted. Frequently in low level languages you'll see code like this:

func doit() {
  if common_case {
    // handle common case inline
  } else {
    doit_general_slow()
  }
}

We avoid function call overhead on the common case and fallback to the slow path otherwise (since doit is a small function it would typically be inlined in a static native language). Tracing JITs generalize this concept to an arbitrary set of "common_case" conditions and arbitrary amount of code which can be hoisted into the fast path. This is a level of optimization that method-based JITs can't really match, even if they employ aggressive inlining they will always pay a higher price in instruction cache penalties because they can't jettison the code for branches not taken. Tracing JITs spend all their time on code that matters so your JIT dollar goes much further. So what's the catch?

Trace based JITs aren't very common, LuaJIT might be the only one to ever get widespread adoption. The catch is the problem called "trace explosion". If your program has many different hot loops the JIT will eat a lot of time and space JIT'ing an immense variety of paths through your code that can exhaust the memory resources of the target machine. Think about recursion and many arm'd switch statements and for loops with lots of nested if/else clauses. One trick is to limit the size of traces and have the ability to stitch traces together so that what may start out as one trace that has many side exits to the interpreter and over time those side exits can be patched to jump to new traces as new hot loops are detected. Lots more heuristics can be used to make the trace "space" not blow up but at the end of the day having a fast interpreter to fall back on, a very fast JIT and the ability to efficiently garbage collect old cold traces is very important. Method based JITs don't have this problem, they JIT a function once and it's done so they typically don't have to worry about code memory consumption. Anyways great effort was made trying to make trace based JITs for other languages that never got much traction, see this paper and this one for some gory details. Adobe even tried to build one for ActionScript that went the way of the dodo bird. Tracing JITs may have not gone mainstream but a lot of the techniques discovered by tracing JIT research live on in today's JavaScript engines and tracing JITs will go down in history as helping to dispel the idea that JITing dynamic languages isn't worth it. Maybe tracing JITs just need more context about what code is important to optimize, like profiling information...

Oh wait, that's what we're here to talk about! So, how do we profile LuaJIT programs? Specifically, how do we implement a zero instrumentation profiler that works with any version of LuaJIT one may find in the myriad versions of OpenResty that are in common use? Basically we just have to walk the stack and detect where native code jumps into Lua code and walk the Lua stack and detect all Lua call frames and report them in the context of the native call frames above and below. Nothing to it!

The obvious first step is to ask your favorite AI agent "How do I write an eBPF unwinder for LuaJIT?". Joking but I actually did this and the answer was kinda like a politician talking about science, a couple little nuggets of sensical things surrounded by reams of non-sense. The real first thing to do is examine all prior art and see what we can learn. The main ones that came to mind are perf, poor man's profiling (i.e. gdb sampling), and LuaJIT's own built-in profiler.

First up is perf, the trusty tool already built into Linux that aims to be able to profile anything. And here's what you get:

Samples: 10K of event 'cpu_core/cycles:P/', Event count (approx.): 8554628639
Children Self Command Shared Object Symbol
- 31.64% 31.64% luajit [JIT] tid 3107635 [.] TRACE_3::/home/tpr/projects/lua/deep.lua:3
TRACE_3::/home/tpr/projects/lua/deep.lua:3
- 28.39% 28.37% luajit [JIT] tid 3107635 [.] TRACE_17::/home/tpr/projects/lua/deep.lua:7
TRACE_17::/home/tpr/projects/lua/deep.lua:7
- 6.54% 6.52% luajit [JIT] tid 3107635 [.] TRACE_12::/home/tpr/projects/lua/deep.lua:7
TRACE_12::/home/tpr/projects/lua/deep.lua:7
- 5.96% 5.95% luajit [JIT] tid 3107635 [.] TRACE_18::/home/tpr/projects/lua/deep.lua:7
TRACE_18::/home/tpr/projects/lua/deep.lua:7
- 4.54% 4.54% luajit [JIT] tid 3107635 [.] TRACE_5::/home/tpr/projects/lua/deep.lua:5
TRACE_5::/home/tpr/projects/lua/deep.lua:5
- 4.16% 4.16% luajit [JIT] tid 3107635 [.] TRACE_4::/home/tpr/projects/lua/deep.lua:7
TRACE_4::/home/tpr/projects/lua/deep.lua:7
- 3.52% 3.50% luajit [JIT] tid 3107635 [.] TRACE_19::/home/tpr/projects/lua/deep.lua:7
TRACE_19::/home/tpr/projects/lua/deep.lua:7
- 3.30% 3.30% luajit [JIT] tid 3107635 [.] TRACE_6::/home/tpr/projects/lua/deep.lua:7
TRACE_6::/home/tpr/projects/lua/deep.lua:7
- 2.86% 2.86% luajit [JIT] tid 3107635 [.] TRACE_13::/home/tpr/projects/lua/deep.lua:7
TRACE_13::/home/tpr/projects/lua/deep.lua:7
- 2.56% 2.56% luajit [JIT] tid 3107635 [.] TRACE_10::/home/tpr/projects/lua/deep.lua:5
TRACE_10::/home/tpr/projects/lua/deep.lua:5
- 1.98% 1.97% luajit [JIT] tid 3107635 [.] TRACE_14::/home/tpr/projects/lua/deep.lua:7
TRACE_14::/home/tpr/projects/lua/deep.lua:7
- 1.61% 1.61% luajit [JIT] tid 3107635 [.] TRACE_20::/home/tpr/projects/lua/deep.lua:7
TRACE_20::/home/tpr/projects/lua/deep.lua:7
- 0.97% 0.97% luajit [JIT] tid 3107635 [.] TRACE_8::/home/tpr/projects/lua/deep.lua:7
TRACE_8::/home/tpr/projects/lua/deep.lua:7
- 0.93% 0.93% luajit [JIT] tid 3107635 [.] TRACE_15::/home/tpr/projects/lua/deep.lua:7
TRACE_15::/home/tpr/projects/lua/deep.lua:7
0.48% 0.48% luajit [JIT] tid 3107635 [.] TRACE_21::/home/tpr/projects/lua/deep.lua:7
0.34% 0.34% luajit [JIT] tid 3107635 [.] TRACE_16::/home/tpr/projects/lua/deep.lua:7
0.16% 0.16% luajit [JIT] tid 3107635 [.] TRACE_11::/home/tpr/projects/lua/deep.lua:7
0.06% 0.00% luajit [unknown] [.] 0xffffffff8ce00f0b
0.06% 0.00% luajit [unknown] [.] 0xffffffff8cd207cb
0.06% 0.00% luajit [unknown] [.] 0xffffffff8bca04ef
0.06% 0.00% luajit [unknown] [.] 0xffffffff8bdee806
0.06% 0.00% luajit [unknown] [.] 0xffffffff8bded8bf
0.03% 0.00% luajit [unknown] [.] 0xffffffff8be05132
0.03% 0.00% luajit [unknown] [.] 0xffffffff8be050ea
0.03% 0.00% luajit [unknown] [.] 0xffffffff8be04f77
0.03% 0.00% luajit [unknown] [.] 0xffffffff8bdf2370
0.02% 0.02% luajit luajit [.] gc_onestep
0.02% 0.02% luajit luajit [.] lj_BC_TGETS
0.02% 0.00% luajit [unknown] [.] 0xffffffff8bf6da77
0.02% 0.00% luajit [unknown] [.] 0xffffffff8bf6d74b
0.02% 0.00% luajit [unknown] [.] 0xffffffff8bf78c4a

This is what perf output looks like if you compile with LUAJIT_USE_PERFTOOLS which will write out a special file to help perf convert PC (program counter) values into file name/line numbers. Well this isn't super helpful because all we see is the top frame and perf doesn't show any call structure. Why is that? Because perf doesn't know how to unwind over LuaJIT JIT frames because LuaJIT doesn't emit frame pointers to make this possible. I suppose if you know your code well and if your call stacks are simple, knowing the PC of the top frame could be helpful. But these days code bases can get complicated and you often need a deeper understanding of "how did I get here" to know what your best optimization strategies are. Okay so short of a time machine to go back 20 years and convincing Mr. Pall that frame pointers are worth burning a register for perf doesn't help us much.

Message from B. Gregg to M. Pall

And even if perf could navigate up to the next frame it would just be the native frame that called the Lua interpreter and not unpack anything about where in the Lua application you were. In some languages like Go/Java/.Net each native frame roughly corresponds to a source language method frame but Lua (like other interpreters) has its own stack separate from the native stack so to profile Lua you need the ability to walk native frames and the frames on the Lua stack.

What about DWARF, we know how to unwind code with dwarf unwinding information, could that help? Actually it turns out Lua has "external" unwinder support which emits runtime dwarf information for the benefit of C++ exception handling so one might think that could be used to walk a Lua stack but this runtime unwinding feature is buried in arcane memory structures and tricky to get at at runtime (not as easy as reading dwarf information from an executable's debug sections). Furthermore that information is only available if Lua is compiled with LUAJIT_UNWIND_EXTERNAL defined which some configurations don't, having to deal with C++ exceptions just isn't as run of the mill on Linux server environments like it is in Windows environments. If you want to see how Lua does this (not for the faint of heart!) you could start here.

But wait, LuaJIT comes with a profiler! Surely that will be helpful. Helpful but turns out to not be exactly what we're looking for. LuaJIT's profiler throws away all the JIT code because the sampling technique is a cooperative one. Basically the Lua profiler is "hook" based and not interrupt based like eBPF samplers are. The profiler runs a timer in the background that just sets a bit and profiling works by having the Lua interpreter inject hooks that check for this bit being set and only sample the stack when the bit is set. JIT code by default doesn't bother to check this bit, so when you enable profiling all JIT code has to be regenerated to inject these checks. That's a party foul when the goal is zero instrumentation! We want to profile the exact running all the time with absolute fidelity to whats running sans profiling (well as close as we can get anyways). Zero-instrumentation is important for production profiling!

So let's do what any sane person would do and poke around with GDB (yes the venerable poor man's profiler is still a thing). In a lot of ways eBPF profilers are the new and improved poor man's profiler for today (rich man's profiler?). No instrumentation or tools required! So here's what's stopping LuaJIT in the middle of a simple CPU burning toy program looks like in gdb:

Program received signal SIGINT, Interrupt.
0x00005555679cfec1 in ?? ()
(gdb) bt
#0 0x00005555679cfec1 in ?? ()
#1 0x00007ffff7e6e890 in ?? ()
#2 0x0000000000000000 in ?? ()

Oh dear. GDB is confused. You get what you pay for I guess. What is 0x5555679cfec1? Well as we saw in perf above it is JIT'd code, here's the proc mappings:

0x555555554000 0x55555555c000 0x8000 0x0 r--p /home/tpr/src/luajit2/src/luajit
0x55555555c000 0x5555555c7000 0x6b000 0x8000 r-xp /home/tpr/src/luajit2/src/luajit
0x5555555c7000 0x5555555e0000 0x19000 0x73000 r--p /home/tpr/src/luajit2/src/luajit
0x5555555e0000 0x5555555e3000 0x3000 0x8b000 r--p /home/tpr/src/luajit2/src/luajit
0x5555555e3000 0x5555555e4000 0x1000 0x8e000 rw-p /home/tpr/src/luajit2/src/luajit
0x5555555e4000 0x555555605000 0x21000 0x0 rw-p [heap]
0x5555679c0000 0x5555679d0000 0x10000 0x0 r-xp

The last one is where gdb stopped our program. Not surprising, GDB got confused because as we already learned DWARF and frame pointer unwinding don't work. But in addition to external unwinding LuaJIT has an extension to address this, LUAJIT_USE_GDBJIT. Unfortunately nobody turns this on in production so it is only helpful for information purposes. Here's what the crux of that looks like. Basically you teach GDB how to walk the trace frames using DWARF FDE's just like GDB normally does with .eh_frame sections from executable files.

Not surprisingly things get complicated when you want to profile JIT code so lets back up a step and start by just looking at the interpreter. If we run that same program with the JIT disabled (-joff) GDB does better:

Program received signal SIGINT, Interrupt.
0x000055555555f05b in lj_BC_CALL ()
(gdb) bt
#0 0x000055555555f05b in lj_BC_CALL ()
#1 0x0000555555573889 in lua_pcall (L=L@entry=0x7ffff7e6c380, nargs=nargs@entry=0, nresults=nresults@entry=-1, errfunc=errfunc@entry=2) at lj_api.c:1121
#2 0x000055555555c8bb in docall (L=L@entry=0x7ffff7e6c380, narg=narg@entry=0, clear=clear@entry=0) at luajit.c:122
#3 0x000055555555db42 in handle_script (argx=<optimized out>, L=0x7ffff7e6c380) at luajit.c:292
#4 pmain (L=0x7ffff7e6c380) at luajit.c:550
#5 0x000055555555f8f6 in lj_BC_FUNCC ()
#6 0x00005555555738e1 in lua_cpcall (L=L@entry=0x7ffff7e6c380, func=func@entry=0x55555555d350 <pmain>, ud=ud@entry=0x0) at lj_api.c:1178
#7 0x000055555555c71e in main (argc=3, argv=0x7fffffffe188) at luajit.c:581

Now we can see that these functions lj_BC_CALL and lj_BC_FUNCC are probably related to the interpreter (BC == bytecode) so maybe we can use these things to know where to run our unwinder. So how do we do that? Well turns out those aren't functions, those are just labels in the interpreter (i.e. jump targets). We have to know the exact bounds of the interpreter to know when the program is executing Lua. For ruby/python it is just a symbol lookup on the interpreter symbol which is the function that contains the interpreter. So what function contains those labels? Turns out there isn't one. LuaJIT's interpreter isn't a function. Here's what LuaJIT's interpreter source looks like. This is basically high level assembly code templating system. LuaJIT comes with its own homegrown assembler called "dynasm" to turn these ".dasc" files into a C program that when run generates an assembly file that the compiler/linker can turn into an object file and be linked into your program. LuaJIT eschews the classic interpreter techniques of using computed GOTO's and other compiler tricks to write a fast interpreter in C and takes C out of the equation entirely. Of course the penalty is that each architecture needs its own VM implementation. Here's what one bytecode instruction looks like:

case BC_ADDVN: case BC_ADDNV: case BC_ADDVV:
| ins_arith add, addsd
break;

ins_arith is a generic arithmetic template that expands into a prologue (bytecode decoding, stack->register motion, type checking) an arithmetic instruction (addsd) and an epilogue (store result, go to next instruction). The instruction is a 32bit value that has 4 parts: OP (ADDVN) A, B, C and it means do A = B + C where A and B are stack slots and C is a constant offset. In the end you get this (for amd64 with some helpful commentary):

luajit`lj_BC_ADDVN:
luajit[0xa28d] <+0>: movzbl %ah, %ebp         ; move B stack offset to ebp
luajit[0xa290] <+3>: movzbl %al, %eax         ; move C constant offset to eax
luajit[0xa293] <+6>: movq (%rdx,%rbp,8), %r11 ; move B stack value to r11
luajit[0xa297] <+10>: sarq $0x2f, %r11        ; shift right 47 bits to expose type
luajit[0xa29b] <+14>: cmpl $-0xe, %r11d       ; compare type to NUM type
luajit[0xa29f] <+18>: jae 0xc052 ; lj_vmeta_arith_vn ; bail on type check fail
luajit[0xa2a5] <+24>: movsd (%rdx,%rbp,8), %xmm0 ; move stack B value to fp reg
luajit[0xa2aa] <+29>: addsd (%r15,%rax,8), %xmm0 ; add constant C to variable
luajit[0xa2b0] <+35>: movsd %xmm0, (%rdx,%rcx,8) ; move result to stack slot A
luajit[0xa2b5] <+40>: movl (%rbx), %eax          ; load next instruction to eax
luajit[0xa2b7] <+42>: movzbl %ah, %ecx           ; load next A to ecx
luajit[0xa2ba] <+45>: movzbl %al, %ebp           ; load OP to ebp
luajit[0xa2bd] <+48>: addq $0x4, %rbx            ; advance PC 4
luajit[0xa2c1] <+52>: shrl $0x10, %eax           ; shift 16 so eax has two high bytes
luajit[0xa2c4] <+55>: jmpq *(%r14,%rbp,8)        ; jump to address of next instruction

Astute readers will wonder, what's with the 47 bits thing and why are we doing floating point math? Lua stack values are 64 bit NaN encoded floating point numbers, if the first 13 bits of a double floating point are 1 it's a NaN, ASCII art to the rescue:

**                  ------MSW------.------LSW------
** primitive types |1..1|itype|1..................1|
** GC objects      |1..1|itype|-------GCRef--------|
** number           ------------double-------------

So if you try to add a primitive (false, true, nil etc) or an object reference to a number it will fail (yeah type safety! shakes fist at Mr Eich). Lua doesn't have raw integers because floating point units are pretty fast these days and with double's you get 52 bits of mantissa which can represent a lot of integers accurately. ~~Who needs 64 bit integers anyways (pro-tip: don't write crypto code in Lua). Fine for a scripting engine.~~ Correction: LuaJIT does support 64 bit integers using the "LL" type suffix, this puts full 64bit values behind a boxed pointer.

So to summarize we move a value from the stack into a register, move a value from a constant pool to a register, type check the stack value and add them with addsd instruction and move the result back to the lua stack and jump to the next instruction. Getting this level of interpreter optimization from the C language is very difficult (impossible?). 15 instructions for one bytecode is impressive! For comparison the written in C classic switch dispatch interpreter for plain Lua uses about 25 instructions for ADD (not a perfect comparison as they don't use the exact same bytecode format but close enough). For reference a simple fibonacci recursive calculator runs 2x in LuaJIT (w/ JIT off) vs plain Lua.

Anyways all these operator implementations are packed together in one contiguous blob by the assembler and the addresses of each instruction is stored in a DISPATCH table that hangs off a global. When we want to go to the next instruction we index the DISPATCH by the next OP code and jump there. This is called a direct threaded interpreter, no need to switch on the OP code. A full list of byte codes can be found here.

So the unwinder works by registering a span of addresses and saying, use unwinder X for a particular span. Again for python this is easy:

interpRanges, err := info.GetSymbolAsRanges("_PyEval_EvalFrameDefault")

For LuaJIT there is a C variable for this code blob, it's called "lj_vm_asm_begin" but it doesn't help us because it's just the start address and we can't lookup that symbol if the library is stripped which it frequently is. To get the extents there's really no direct way. Luckily if we examine the .eh_frame symbols for LuaJIT we can find it as these aren't stripped. One of these things is not like the others (this is dwarfdump -F output):

< 12><0x00009e60:0x0000def5><><cie offset 0x00000020::cie index 1><fde offset 0x00000074 length: 0x00000018>
<eh aug data len 0x0>
0x00009e60: <off cfa=80(r7) > <off r3=-24(cfa) > <off r6=-16(cfa) > <off r14=-40(cfa) > <off r15=-32(cfa) > <off r16=-8(cfa) >

This is the information GDB used to walk the stack above so pretty safe bet it's accurate. Breaking out the calculator, 0xdef5 - 0x9e60, thats a 16k function with one dwarf FDE (frame descriptor entry), that's orders of magnitude larger than any other function and we can gain more confidence that this is the interpreter by noticing that the 0x80 offset for the stack pointer (r7) is CFRAME_SIZE. That's still a little loosey goosey for a production code so we take the added measure of looking for the lj_vm_asm_begin address and cross-checking against that when debug information hasn't been stripped (and we run unit tests against every version we can get our hands on to make sure they all work).

Cool, so now if our profiler hits a stack frame with a PC in that range we know we're executing the interpreter!

Now what about those pesky JIT frames? Here we need to help the unwinder out by manually adjusting the stack pointer. Turns out by default most traces have a fixed stack frame size but some traces need more space and this is stored in the trace metadata. So to find the stack frame size we have to add the stack adjustment "spadjust" value to the CFRAME_SIZE_JIT constant.

Okay so now we know how to jump over a JIT frame to continue unwinding on the other side but how do we register the addresses for the machine code? We take the low tech approach of just looking at the proc maps and registering the entire virtual address span for any anonymous/executable regions. This is a half measure though as when we get a hit we don't know a priori what the stack adjustment is, so we guess it's 0 and make an attempt to unwind. If after unwinding the Lua frames we fail to walk to the rest of the C stack we'll report a truncated stack. But if we successfully get our context we'll report that back to user land and then in user land we poke around using remote memory accesses to record each trace's exact machine code boundaries and stack adjustment. More on this later, before we get into how that works we need some context.

Context is everything!

So how do we, a poor little eBPF program, find our context? For other interpreters we rely on the fact that there can only be one and usually read the VM context pointer out of a global (Ruby) or a thread local (python). Well that's no fun, what if there are multiple instances of LuaJIT in the same process (i.e. separate LuaJIT instances per application or per user)? That happens in practice and furthermore Lua doesn't put anything in any public global or thread local variables we can bank on, remember LuaJIT is meant to be embedded, it doesn't make any assumptions about the process like other languages that assume they are the only show in town. Anyways every Lua instance has some global state (heap allocated, doesn't exist until "lua_open" is called) usually referred to as "G". There's also a runtime or execution context, called "L", that roughly means "thread". Multiple Ls are needed to support co-routines (another cool Lua feature) but there's no built-in concurrency support for real threading. If we can find "L" we can get "G" and if we can get "G" we can get the currently executing "L" so either one will do.

So how do we get the G or L? Always important to explore the solution space! One approach that was used by the APISIX profiler is to use uprobes on the lua_call/lua_resume functions and record "L" before we get to the interpreter. But uprobes come with their own baggage and overhead we don't want to enjoin. In practice they work pretty well but customers are gun shy about uprobes and we'd rather only rely on them if there was no other choice. But this is software, there are always other choices! For interpreter frames this is easy, there's an L pointer that's always (well extremely frequently, there are always edge cases with arbitrary interrupt sampling, i.e. function prologue/epilogues) at a constant offset from the stack pointer.

Okay easy enough what about the JIT. This is harder because JIT doesn't stash the L pointer on the stack, instead we have to poke around in the generated JIT code and see where it gets the context when it needs it. Turns out lj_gc_step_jit is a frequently called internal function in JIT traces that does just this, looks like this:

64c3c112fe8c lea rdi, [r14-0xfa8]
64c3c112fe93 call 0x64c3b1b634d0	->lj_gc_step_jit

In a JIT frame r14 holds the dispatch table which is the direct threaded jump table containing the addresses of bytecode operator implementations. It's a "global" and can be used to find other global things with a constant offset that suits our purposes nicely. So how do we know that the offset is 0xfa8? Well it isn't always, it varies depending on LuaJIT version, platform etc. In order to find this offset it would be nice to just lookup symbols and calculate the offset via symbol offset math. Lua's global state looks like this:

struct GG_State {
  lua_State L;
  global_State g;
  jit_State J;
  HotCount hotcount[64];
  ASMFunction dispatch[243];
  BCIns bcff[57];
}

For the version of LuaJIT above on amd64 0xfa8 is the distance from the dispatch address to the global_State pointer "g". So if we had symbols we could just do something like:

(gdb) p/x (char*)&((GG_State*)0)->dispatch - (char*)&((GG_State*)0)->g
$3 = 0xfa8

But we don't always have debug information. Back to the drawing board. What if we could analyze the LuaJIT code for a public function that we know has did this g to dispatch math? The idea is to walk the instructions until we find the math we're looking for. But walking instructions is dicey and compilers change and compiler switches change and inline decisions change so we want to find the the simplest case possible. Running 'nm -UD' on libluajit's shared library shows we have 152 candidates! By brute force and ignorance we stumbled upon "luaopen_jit" which as the first thing it does calls a function called "jit_init" which does very little except call this function call "lj_dispatch_update" which has this line:

ASMFunction *disp = G2GG(g)->dispatch;

So given a global_State pointer "g" derive GG address and dereference "dispatch". Looks promising, what's the assembly look like?

libluajit-5.1.so[0x16d4e] <+94>: leaq 0xfa8(%rdx), %r10

So can we rip through the instructions for "lj_dispatch_update" and find this lea (load effective address) instruction? This is at offset 94 so we don't have a ton of lead up instructions to throw us off. That's important for two reasons, it limits the chances the compiler will so something tricky (like stash rdx on the stack and read it back later) and it avoids issues with Golangs incomplete x86 disassembler (to be fair x86 is a complicated beast). Here's the function we're dealing with:

void lj_dispatch_update(global_State *g)

So we know "g" will be in the rdi register (where first arg goes in Linux amd64 calling conventions) but because compilers do what compilers do it's in the rdx register by the time we get to our lea. So can we track where "g" goes and look for a lea instruction? It turns out in all versions of LuaJIT finding the first lea of a g relative address works to find this offset. The code looks like this. Hot bananas! Did you read it? Pretty simple actually, so nice of the Go folks to include a disassembler, batteries included indeed.

Turns out to walk Lua completely we only need to find 2 offsets in such a manner, the "g" to "dispatch" offset and the "g" to "traces" offset which is a field in the jit_State we need to find all the LuaJIT traces so we can look up their machine addresses and "spadjust" values to calculate the frame size.

The only piece missing left is how do we walk the Lua frames?

Lua has its own stack that is stored on the Lua "L" pointer. The start of the stack is stored in "stack" and the top of the stack (ie the currently executing function) is stored at "base". Base is the start of the the stack for the top frame, so base + 0 is stack slot zero. The frame information is stored before that. It looks like this:

** base-2 base-1       | base base+1 ...
** [func  PC/delta/ft] | [slots ...]
**        ^-- frame    | ^-- base ^-- top

So at base-1 is a PC pointer return address (i.e. the address of the next instruction after the call instruction) and a function pointer that we called. So to "walk" the stack we need two additional bits of information, what is the PC of the "top" function and what is the "delta" between frames? The PC for the top frame is stored in the interpreter native stack frame at a constant offset so that's easy to get (and conveniently also maintained by the JIT) and the delta between two frames is stored in the CALL instruction which can be found by looking at call instruction at PC - 1. Of course there's lots of "details" like a Lua call frames can be a native function call (i.e. an FFI call) and things are a little different in those cases but we won't go into that here.

So for an eBPF program that's trying to do as little work as possible in eBPF and just send the minimal amount of information back to user land what's the minimal amount of work we need to do? I mean we could just snapshot the entire Lua stack and send that over and have the user land process that. But because of LuaJIT's FFI support we have to allow for Lua calling out to native code and because Lua's FFI supports a "callback" mode where a Lua function can be wrapped in a C function call we have to also support arbitrary C frames calling back into Lua. Here's an example of Lua's FFI capabilities:

local ffi = require "ffi"

ffi.cdef[[
long random();
void qsort(void *base, size_t nel, size_t width, int (*compar)(const long *, const long *));
]]

function compare(a, b)
return a[0] - b[0]
end

local callback = ffi.cast("int (*)(const long *, const long *)", compare)
local n = 20
local arr = ffi.new("long[?]", n)
for i=0,n-1 do
arr[i] = ffi.C.random()
end
ffi.C.qsort(arr, n, ffi.sizeof("long"), callback)
for i=0,n-1 do
print(arr[i])
end

By the time the "compare" function is called the stack frame is a tangled weave of different types of calls:

frame #5: 0x000055555555f8f6 luajit`lj_BC_FUNCC + 68
frame #6: 0x00007ffff7c49940 libc.so.6`msort_with_tmp(p=0x00007fffffffd790, b=0x00007ffff7e7a928, n=2) at qsort.c:239:8
frame #7: 0x00007ffff7c49715 libc.so.6`msort_with_tmp [inlined] msort_with_tmp(n=2, b=0x00007ffff7e7a928, p=0x00007fffffffd790) at qsort.c:201:6
frame #8: 0x00007ffff7c49707 libc.so.6`msort_with_tmp(p=0x00007fffffffd790, b=0x00007ffff7e7a928, n=5) at qsort.c:209:3
frame #9: 0x00007ffff7c49715 libc.so.6`msort_with_tmp [inlined] msort_with_tmp(n=5, b=0x00007ffff7e7a928, p=0x00007fffffffd790) at qsort.c:201:6
frame #10: 0x00007ffff7c49707 libc.so.6`msort_with_tmp(p=0x00007fffffffd790, b=0x00007ffff7e7a928, n=10) at qsort.c:209:3
frame #11: 0x00007ffff7c49715 libc.so.6`msort_with_tmp [inlined] msort_with_tmp(n=10, b=0x00007ffff7e7a928, p=0x00007fffffffd790) at qsort.c:201:6
frame #12: 0x00007ffff7c49707 libc.so.6`msort_with_tmp(p=0x00007fffffffd790, b=0x00007ffff7e7a928, n=20) at qsort.c:209:3
frame #13: 0x00007ffff7c49c0d libc.so.6`__GI___qsort_r [inlined] msort_with_tmp(n=20, b=0x00007ffff7e7a928, p=0x00007fffffffd790) at qsort.c:201:3
frame #14: 0x00007ffff7c49c08 libc.so.6`__GI___qsort_r(pbase=0x00007ffff7e7a928, total_elems=20, size=8, cmp=<unavailable>, arg=<unavailable>) at qsort.c:393:7
frame #15: 0x0000555555561d62 luajit`lj_vm_ffi_call + 132
frame #16: 0x00005555555a7f45 luajit`lj_ccall_func(L=<unavailable>, cd=<unavailable>) at lj_ccall.c:1402:5
frame #17: 0x0000555555588dfd luajit`lj_cf_ffi_meta___call(L=0x00007ffff7e6c380) at lib_ffi.c:230:15
frame #18: 0x000055555555f8f6 luajit`lj_BC_FUNCC + 68
frame #19: 0x0000555555573889 luajit`lua_pcall(L=0x00007ffff7e6c380, nargs=0, nresults=-1, errfunc=2) at lj_api.c:1150:12
frame #20: 0x000055555555c8bb luajit`docall(L=0x00007ffff7e6c380, narg=0, clear=0) at luajit.c:122:12
frame #21: 0x000055555555db42 luajit`pmain [inlined] handle_script(argx=<unavailable>, L=0x00007ffff7e6c380) at luajit.c:292:14
frame #22: 0x000055555555dac6 luajit`pmain(L=0x00007ffff7e6c380) at luajit.c:550:17
frame #23: 0x000055555555f8f6 luajit`lj_BC_FUNCC + 68
frame #24: 0x00005555555738e1 luajit`lua_cpcall(L=0x00007ffff7e6c380, func=0x0000000000009350, ud=0x0000000000000000) at lj_api.c:1178:12
frame #25: 0x000055555555c71e luajit`main(argc=2, argv=0x00007fffffffe178) at luajit.c:581:12
frame #26: 0x00007ffff7c2a1ca libc.so.6`__libc_start_call_main(main=(luajit`main at luajit.c:570:1), argc=2, argv=0x00007fffffffe178) at libc_start_call_main.h:58:16
frame #27: 0x00007ffff7c2a28b libc.so.6`__libc_start_main_impl(main=(luajit`main at luajit.c:570:1), argc=2, argv=0x00007fffffffe178, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe168) at libc-start.c:360:3
frame #28: 0x000055555555c795 luajit`_start + 37

So our unwinder needs to know when these special frames are called and jump back to the native unwinder at the right times. As if that weren't enough LuaJIT will also JITs the code to implement these callbacks. In order to call a Lua function from a C function all the arguments have to be converted, i.e. convert C strings into Lua strings, convert native ints into Lua NaN encoded doubles etc. You can start to see why LuaJIT is so popular for embedding, it makes calling back and forth between the host and Lua extremely easy and efficient.

So back to what to record, the choices are to record the raw GCfunc pointers we find on the stack or we can unpack them to their GCproto which is where the meat of the function is stored (bytecode, line number info). In order to keep the userland unwind code as uncoupled from LuaJIT internals as possible we choose the later.

Okay so we know how to find the LuaJIT context pointers, walk across the LuaJIT JIT frames and how to walk the Lua language stack. So how to do we put it all together into a profiler? First we need to discuss how we get from registering the entire anonymous executable to knowing the addresses of each trace and what the specific stack delta adjustment is.

Normally PC lookups in unwinders are done via a binary search on a sorted list of the start addresses. This works well, although typically in order to avoid too many loops and hitting eBPF instruction count limits they need to be two level lookups. But there's another option that works particularly well for our use case, longest prefix match tries. Frequently used for IP addresses (internet protocol, not instruction pointer) an LPM trie lets us register overlapping addresses which doesn't work with the binary search approach. Luckily eBPF has LPM tries built in. The eBPF map lets us have an arbitrary node value with each entry so we can stuff the stack adjustment in there. Furthermore in order to circumvent that pesky problem with relying on the DISPATCH register (remember r14) we can stuff the "g" context pointer in there as well (machine code isn't shared across LuaJIT instances). I glossed over it earlier but the sampler may fire when the JIT has called some internal helper (like the GC step helper we saw earlier) and we can't rely on r14 holding the "DISPATCH" pointer. So by passing "g" in the LPM tree value we can move away from assuming r14 is the dispatch address and that the stack adjustment is zero. This is important because unlike traditional JITs tracing JITs are continuously creating new traces and deleting unused ones. It's also possible for LuaJIT to add new virtual address regions for additional traces. And LuaJIT can also expand the existing regions (ie "realloc"). Accounting for all this gets a little tricky, no problem, best way to deal with tricky bits like that is throw some unit tests at them.

How do we deal with trace garbage collection? Our unwinder can tell when we get triggered by a PC falling in the entire anonymous executable VM range vs when it gets triggered by the machine code span for a specific trace. We use that as a trigger to send an event to user land which will re-sync our mappings. Once we're synced up the event won't fire until new traces come into existence.

The final detail to clear up is how do we get actual function names, file names and line numbers. You'd think this would be easy but again Lua as a pure functional language there's a bit of work. A function in Lua is an object with no inherent name, in bytecode defining a looks like this:

0001 FNEW 0 0 ; func.lua:1
0002 GSET 0 1 ; "func"

So in order to report stack frames with function names we have to do a bit of guess work. Essentially we look at call sites and find the "name" used to lookup the function:

0003 GGET 0 1 ; "func"
0004 CALL 0 1 1

Okay so that's the theory, how do we deal with it in practice? In practice each frame must have these minimal pieces of information to be "symbolized".

PC of currently executing frame
Offset of that PC from the start of the bytecode for the function. We need this because the bytecode is co-allocated with the GCproto, meaning the bytecode starts at the end of the GCproto struct. So PC - Offset_From_Start = GCproto* + 1 in C pointer math speak. So if we have the offset and we know the size of GCproto we can calculate the GCproto address.
PC of calling function so that we can walk the bytecode and gin up a name for the callee.
Offset of that PC from start so we can get the caller's GCproto

So to summarize the answer to the question how do you write an unwinder in Lua (or at least the PolarSignals answer):

Register hooks on interpreter code boundaries and any anonymous executable regions to go to our unwinder
Get context pointers from stack for interpreter frames and DISPATCH register for freshly seen JIT frames and from LPM entry for "known" traces.
Walk Lua stack but bail to native unwinder if we see any FFI callback frames and report PC and GCproto address for each frame back to userland.
In userland convert each frame to function, file name and line number using GCproto fields and bytecode walking to derive function name.
Periodically and under signal from unwinder revisit LuaJIT executable/anonymous virtual regions and trace definitions to make sure LPM map entries are up to date.

It's that easy! The AI agents didn't do a very good job with helping figure any of this out so maybe the AI prognosticators that theorize programmers will all have to become farmers soon are wrong! Darn. Honestly though this is a cleaned up "revisionist history" that mostly only talks about the final design. In reality weeks were spent single stepping (forwards and backwards, thank you RR!) through the LuaJIT engine trying to piece together what was going on and coming up with solutions that didn't work and throwing them out. Oh well, history is written by the victors!

So what does it look like in practice? Check it out below! There's still some rough edges to sort out but it mostly works.

Swing over to Parca central if you want to give it a whirl and hop on to our Discord if you need any help!

Discuss: