Further adventures in the JIT
This post continues our adventures in the JIT, digging a bit deeper into the implementation details.
While writing things in machine code (assembler) gives you great freedom it comes at the cost of having to invent almost everything yourself, and there’s no clever compiler to help you catch mistakes. For example, if you call a function in a certain manner and said function doesn’t expect that, you’ll crash your OS process at best or spend hours chasing a heisenbug at worst.
Hence, conventions are always front and center when writing assembler, so we need to visit some of the ones we’ve chosen before moving on.
The most important one concerns registers, and we base it on the system calling convention to make it easier to call C code. I’ve included tables for the SystemV convention used on Linux below. The registers differ on other systems like Windows, but the principle is the same on all of them.
| Register | Name | Callee save | Purpose | 
|---|---|---|---|
| RDI | ARG1 | no | First argument | 
| RSI | ARG2 | no | |
| RDX | ARG3 | no | |
| RCX | ARG4 | no | |
| R8 | ARG5 | no | |
| R9 | ARG6 | no | Sixth argument | 
| RAX | RET | no | Function return value | 
Thus, if we want to call a C function with two arguments, we move the first
into ARG1 and the second into ARG2 before calling it, and we’ll have the
result in RET when it returns.
Beyond saying which registers are used to pass arguments, calling conventions also say which registers retain their value over function calls. These are called “callee save” registers, since the called function needs to save and restore them if they’re modified.
In these registers, we keep commonly-used data that rarely (if ever) changes in C code, helping us avoid saving and restoring them whenever we call C code:
| Register | Name | Callee save | Purpose | 
|---|---|---|---|
| RBP | active_code_ix | yes | Active code index | 
| R13 | c_p | yes | Current process | 
| R15 | HTOP | yes | Top of the current process’ heap | 
| R14 | FCALLS | yes | Reduction counter | 
| RBX | registers | yes | BEAM register structure | 
We also keep the current process’ stack in RSP, the machine stack pointer,
to allow call and ret instructions in Erlang code.
The downside of this is that we can no longer call arbitrary C code as it may assume a much larger stack, requiring us to swap back and forth between a “C stack” and the “Erlang stack”.
In my previous post we called a C function (timeout) without doing any of
this, which was a bit of a white lie. It used to be done that way before we
changed how the stack works, but it’s still pretty simple as you can see below:
void BeamModuleAssembler::emit_timeout() {
    /* Swap to the C stack. */
    emit_enter_runtime();
    /* Call the `timeout` C function.
     *
     * runtime_call compiles down to a single `call`
     * instruction in optimized builds, and has a few
     * assertions in debug builds to prevent mistakes
     * like forgetting to switch stacks. */
    a.mov(ARG1, c_p);
    runtime_call<1>(timeout);
    /* Swap back to the Erlang stack. */
    emit_leave_runtime();
}
Swapping the stack is very cheap because of a trick we use when setting up
registers: by allocating the structure on the C stack we can compute the
address of said stack from registers, which avoids having to reserve a
precious callee save register and is much faster than having it saved in memory
somewhere.
With the conventions out of the way we can start looking at code again. Let’s
pick a larger instruction this time, test_heap, which allocates heap
memory:
void BeamModuleAssembler::emit_test_heap(const ArgVal &Needed,
                                         const ArgVal &Live) {
    const int words_needed = (Needed.getValue() + S_RESERVED);
    Label after_gc_check = a.newLabel();
    /* Do we have enough free space already? */
    a.lea(ARG2, x86::qword_ptr(HTOP, words_needed * sizeof(Eterm)));
    a.cmp(ARG2, E);
    a.jbe(after_gc_check);
    /* No, we need to GC.
     *
     * Switch to the C stack, and update the process
     * structure with our current stack (E) and heap
     * (HTOP) pointers so the C code can use them. */
    emit_enter_runtime<Update::eStack | Update::eHeap>();
    /* Call the GC, passing how many words we need and
     * how many X registers we use. */
    a.mov(ARG2, imm(words_needed));
    a.mov(ARG4, imm(Live.getValue()));
    a.mov(ARG1, c_p);
    load_x_reg_array(ARG3);
    a.mov(ARG5, FCALLS);
    runtime_call<5>(erts_garbage_collect_nobump);
    a.sub(FCALLS, RET);
    /* Swap back to the Erlang stack, reading the new
     * values for E and HTOP from the process structure. */
    emit_leave_runtime<Update::eStack | Update::eHeap>();
    a.bind(after_gc_check);
}
While this isn’t too complicated it’s still a rather large amount of code: since all instructions are emitted directly into their modules, small inefficiencies like this tend to bloat the modules rather quickly. Beyond using more RAM, this wastes precious instruction cache so we’ve spent a lot of time and effort on reducing code size.
Our most common method of reducing code size is to break out as much of the instruction as possible into a globally shared part. Let’s see how we can apply this technique:
void BeamModuleAssembler::emit_test_heap(const ArgVal &Needed,
                                         const ArgVal &Live) {
    const int words_needed = (Needed.getValue() + S_RESERVED);
    Label after_gc_check = a.newLabel();
    a.lea(ARG2, x86::qword_ptr(HTOP, words_needed * sizeof(Eterm)));
    a.cmp(ARG2, E);
    a.jbe(after_gc_check);
    a.mov(ARG4, imm(Live.getValue()));
    /* Call the global "garbage collect" fragment. */
    fragment_call(ga->get_garbage_collect());
    a.bind(after_gc_check);
}
/* This is the global part of the instruction. Since we
 * know it will only be called from the module code above,
 * we're free to assume that ARG4 is the number of live
 * registers and that ARG2 is (HTOP + bytes needed). */
void BeamGlobalAssembler::emit_garbage_collect() {
    /* Convert ARG2 to "words needed" by subtracting
     * HTOP and dividing it by 8.
     *
     * This saves us from having to explicitly pass
     * "words needed" in the module code above. */
    a.sub(ARG2, HTOP);
    a.shr(ARG2, imm(3));
    emit_enter_runtime<Update::eStack | Update::eHeap>();
    /* ARG2 and ARG4 have already been set earlier. */
    a.mov(ARG1, c_p);
    load_x_reg_array(ARG3);
    a.mov(ARG5, FCALLS);
    runtime_call<5>(erts_garbage_collect_nobump);
    a.sub(FCALLS, RET);
    emit_leave_runtime<Update::eStack | Update::eHeap>();
    a.ret();
}
While we had to write about as much code, the part that is copied into the module is significantly smaller.
In our next post we’ll take a break from implementation details and look at the history behind this JIT.