You can probably eliminate the switch bounds check by making the default case execute __builtin_unreachable(). The switch version is safe for non-conforming opcodes where the computed goto version invokes UB.

I understood up to the comparison operations in this article, but I'm having trouble understanding branch prediction even after reading it. It seems like they used branch prediction optimization... Sometimes when I read articles like this, I start to question whether I can really call myself a programmer

All well and good provided your opcodes are sequential/dense

> Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

Safety only has a cost in this case because the switch is fundamentally just operating on an integer. With an actual enumerated type (rather than C's primitive "enums as numeric aliases"), which even a basic type system could trivially enforce, there would be no need for this check, because the domain of the value would be guaranteed at compile-time.

Using __builtin_unreachable in this way makes the switch version unsafe, too.

In short modern CPUs can’t actually execute control flow very fast, so they usually cheat and memorize where the program is likely to go next. This works great with things like the loop branch, since most loops are pretty long the cpu can start on the next iteration of the loop while it waits for the condition to finish being calculated. You incur one branch misprediction at the end, but the rest of the loop is fast. But with a bytecode vm, the action of the code is entirely dependent on the bytecode instructions, so it’s not predictable unless those instructions themselves are predictable. That’s why the switch statement is slow. Explaining the optimization from computed gotos is more complicated.

As I understand it, the branch predictor stores its predictions using the address of the branch instruction as a key. In the switch implementation the unpredictable branch is the one that jumps to the specific case. The compiler emits an indirect branch for this just like the computed goto instruction. But it only emits one indirect branch instruction, so that branch is always unpredictable. In the computed goto version, there is no loop, each case statement ends in its own indirect branch, so each bytecode implementation is followed by a branch with a unique address. In practice this makes each of them slightly more predictable, because you only have to predict what comes after an inc instruction instead of what comes next after any instruction. Basically there are more branches to hang predictions on.

In a CPU, there are 2 kinds of branch predictors, predictors for conditional branches, which must predict only whether the branch is taken or not taken, and predictors for indirect branch instructions, which must predict the address of the next instruction that will be executed after the indirect branch instruction.

The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.

When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.

When another load instruction is executed later, the load address is used to query the cache memory and if the query is successful the data value is returned by the query and it is used instead of accessing again the main memory.

Similarly, (simplifying a lot) when an indirect branch instruction is executed the address where the execution continues after the indirect branch is stored together with the address of the branch instruction in an associative memory similar to a cache memory.

When an indirect branch instruction is executed later, the address of the branch instruction is used to query the associative memory and if the query is successful it will return the address of the next instruction to be executed and execution will continue there speculatively, until it is confirmed by reading from the main memory that this is the correct jump address. If the jump address is wrong then all the work done is discarded and execution continues at the right address.

This description is simplified, because modern branch descriptors may store for a given branch instruction address not only the last jump address, but a sequence of past jump addresses, together with a pattern about how they have been used (e.g. alternating between jumping twice to the first address and jumping once to the second address). Thus successive queries for the branch instruction address will retrieve different jump addresses, based on their activation pattern.

The point of TFA is that if the dispatch loop contains a single indexed branch instruction, then the associative memory of the indirect branch predictor contains a single record, keyed by the address of that instruction. Depending on the CPU, the record will contain one or more past addresses, but it can predict correctly only a short periodic pattern of jump addresses, i.e. of opcodes used in dispatching.

This could work to predict correctly a short loop in the interpreted program, but typically most of the predictions will be wrong and each branch misprediction takes much more time than the execution of any instruction. In smaller and cheaper CPUs, which might store a single jump address per record, the correct predictions would be extremely rare, as they would happen only if the interpreted program contains repeated instructions.

In the optimized program, the indexed branch instruction is replicated into each "case" so now the associative memory will store separate records, one for each "case", containing the address or addresses of the next cases where execution has continued in the past.

Because after each interpreted instruction the probabilities of the next instructions are not equal, but some instructions are more likely to follow, that greatly increases the chances that the indirect branch predictor will make correct predictions from time to time.

All well and good provided your opcodes are sequential/dense

The same is true for a "switch" statement.

If the "case" values used in a "switch" are sparse, the "switch" will not be compiled into an indexed jump instruction, but into a sequence of conditional jump instructions, which test all the possible cases.

In this situation, the alternative to "switch" for implementing dispatching is no longer a computed "goto", but a multiple "if"/"else" sequence.

A smarter compiler could detect when a "switch" forms the body of a loop and it would replicate the indexed jump instruction at the end of each case, instead of jumping to the beginning of the loop to execute there an indexed jump, or even worse, first jumping to the end of the loop to terminate the "switch", then jumping to the beginning of the loop to repeat the loop body.

With such a compiler, computed "goto" would not be necessary as an alternative to "switch".

The range check inside the dispatch loop would not be necessary if the opcodes had an enumeration type (in a programming language where enumeration types are clearly distinct from integers) and the "switch" handled all the possible cases. In that situation, the range checks would be moved elsewhere in the program, wherever opcodes are generated.

> Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

FWIW it's possible to get rid of the range check via a __builtin_unreachable in the default case (which of course is essentially injected UB):

https://www.godbolt.org/z/YTcY3E8hx

...it's actually interesting though that the compiler adds the range check on a 'complete' switch over an enum, because the compiler would complain when a case branch is missing (via a warning). E.g. the frontend is smarter than the backend.

The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.

When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.

In practice, branch target predictors are also used for conditional branches too, and even unconditional, direct jumps. Yes, the target can (sort of by definition) be computed from the instruction and the pc, but the by using the branch target buffer, you avoid pipeline bubbles in the correctly predicted case by only having a data dependency on the pc, and predicting the target while the instruction is still being fetched. That way on the next cycle you're already fetching from the predicted target while still decoding the branch instruction itself in the next stage(s).

I wonder are there any compilers that recognizes switch-in-loop pattern of interprets and optimizes it to be branch-predictor friendly using multiple indirect jumps?

You really explain things in a way that's very easy to understand. I think I finally get it now. I understood the conditional branch part, but I think I was having trouble with the indirect branch predictor. So basically, when a branch instruction is executed, the CPU learns the actual target address. In the case of a switch statement, because the key is a single value like s, the history records get mixed up, whereas with a computed goto, since there are multiple keys, they don't get mixed up. Thank you for the clear explanation. It's a bit embarrassing that I didn't know this, but thanks to people like you, I feel like I'm learning more.

The same is true for a "switch" statement.

In this situation, the alternative to "switch" for implementing dispatching is no longer a computed "goto", but a multiple "if"/"else" sequence.

With such a compiler, computed "goto" would not be necessary as an alternative to "switch".

The GCC documentation example for computed goto uses an indexing table, but you can of course use computed goto addresses directly. The problem is smuggling the label addresses outside the function so you don't need the enum/index indirection during dispatch.

I once experimented with some techniques and found you can use constructors, either on function-scoped static variables in C++ or using the constructor function attribute on a nested function in GCC C. The addresses are visible to the constructor function and you can smuggle them by copying through a file- or global- scoped array, then at runtime initialize your data structures and bytecode arrays with the computed goto addresses. That preserves the CPUs branch prediction and prefetching resources for other work.

You can also lazily initialize things, which works well if you're not precompiling bytecode but implementing something like a coroutine or state machine where you just need to initialize the first dispatch address.

Even on modern chips avoiding computed goto indexing can be meaningful (5+%) depending on use case, but the marginal gains from the smuggling shenanigans specifically probably aren't worth the code complexity and portability headaches (clang doesn't support nested functions in C), though I never tried that specific trick with a full-blown bytecode VM.

These days with guaranteed tailcall extensions you can just dispatch on function addresses (knowable before dispatch), though you can't directly control inlining optimizations so if you're mostly doing simple work per operation computed gotos might still have better performance. And for small or straight-forward state machines code can be easier to grok when not split across many functions, especially if they're tailcalling each other through pointers.

FWIW it's possible to get rid of the range check via a __builtin_unreachable in the default case (which of course is essentially injected UB):

https://www.godbolt.org/z/YTcY3E8hx

I wonder are there any compilers that recognizes switch-in-loop pattern of interprets and optimizes it to be branch-predictor friendly using multiple indirect jumps?

I believe that Rust does this which implies that it might be common to the clang back end.

Recently, while idly browsing through the source code of Python, I came upon an interesting comment in the bytecode VM implementation (Python/ceval.c) about using the computed gotos extension of GCC [1]. Driven by curiosity, I decided to code a simple example to evaluate the difference between using a computed goto and a traditional switch statement for a simple VM. This post is a summary of my findings.

Defining a simple bytecode VM

First let's make clear what I mean by a "VM" in this context - a Bytecode Interpreter. Simply put, it's a loop that chugs through a sequence of instructions, executing them one by one.

Using Python's 2000-line strong (a bunch of supporting macros not included) PyEval_EvalFrameEx as an example wouldn't be very educational. Therefore, I'll define a tiny VM whose only state is an integer and has a few instructions for manipulating it. While simplistic, the general structure of this VM is very similar to real-world VMs. This VM is so basic that the best way to explain it is just to show its implementation:

#define OP_HALT 0x0 #define OP_INC 0x1 #define OP_DEC 0x2 #define OP_MUL2 0x3 #define OP_DIV2 0x4 #define OP_ADD7 0x5 #define OP_NEG 0x6

int interp_switch(unsigned char* code, int initval) { int pc = 0; int val = initval;

while (1) {
    switch (code\[pc++\]) {
        case OP\_HALT:
            return val;
        case OP\_INC:
            val++;
            break;
        case OP\_DEC:
            val--;
            break;
        case OP\_MUL2:
            val \*= 2;
            break;
        case OP\_DIV2:
            val /= 2;
            break;
        case OP\_ADD7:
            val += 7;
            break;
        case OP\_NEG:
            val = -val;
            break;
        default:
            return val;
    }
}

}

Note that this is perfectly "standard" C. An endless loop goes through the instruction stream and a switch statement chooses what to do based on the instruction opcode. In this example the control is always linear (pc only advances by 1 between instructions), but it would not be hard to extend this with flow-control instructions that modify pc in less trivial ways.

The switch statement should be implemented very efficiently by C compilers - the condition serves as an offset into a lookup table that says where to jump next. However, it turns out that there's a popular GCC extension that allows the compiler to generate even faster code.

Computed gotos

I will cover the details of computed gotos very briefly. For more information, turn to the GCC docs or Google.

Computed gotos is basically a combination of two new features for C. The first is taking addresses of labels into a void*.

void* labeladdr = &&somelabel; somelabel: // code

The second is invoking goto on a variable expression instead of a compile-time-known label, i.e.:

void* table[]; // addresses goto *table[pc];

As we'll see shortly, these two features, when combined, can facilitate an interesting alternative implementation of the main VM loop.

To anyone with a bit of experience with assembly language programming, the computed goto immediately makes sense because it just exposes a common instruction that most modern CPU architectures have - jump through a register (aka. indirect jump).

The simple VM implemented with a computed goto

Here's the same VM, this time implemented using a computed goto [2]:

int interp_cgoto(unsigned char* code, int initval) { /* The indices of labels in the dispatch_table are the relevant opcodes */ static void* dispatch_table[] = { &&do_halt, &&do_inc, &&do_dec, &&do_mul2, &&do_div2, &&do_add7, &&do_neg}; #define DISPATCH() goto *dispatch_table[code[pc++]]

int pc = 0;
int val = initval;

DISPATCH();
while (1) {
    do\_halt:
        return val;
    do\_inc:
        val++;
        DISPATCH();
    do\_dec:
        val--;
        DISPATCH();
    do\_mul2:
        val \*= 2;
        DISPATCH();
    do\_div2:
        val /= 2;
        DISPATCH();
    do\_add7:
        val += 7;
        DISPATCH();
    do\_neg:
        val = -val;
        DISPATCH();
}

}

Benchmarking

I did some simple benchmarking with random opcodes and the goto version is 25% faster than the switch version. This, naturally, depends on the data and so the results can differ for real-world programs.

Comments inside the CPython implementation note that using computed goto made the Python VM 15-20% faster, which is also consistent with other numbers I've seen mentioned online.

Why is it faster?

Further down in the post you'll find two "bonus" sections that contain annotated disassembly of the two functions shown above, compiled at the -O3 optimization level with GCC. It's there for the real low-level buffs among my readers, and as a future reference for myself. Here I aim to explain why the computed goto code is faster at a bit of a higher level, so if you feel there are not enough details, go over the disassembly in the bonus sections.

The computed goto version is faster because of two reasons:

The switch does a bit more per iteration because of bounds checking.
The effects of hardware branch prediction.

Doing less per iteration

If you examine the disassembly of the switch version, you'll see that it does the following per opcode:

Execute the operation itself (i.e. val *= 2 for OP_MUL2)
pc++
Check the contents of code[pc]. If within bounds (<= 6), proceed. Otherwise return from the function.
Jump through the jump table based on offset computed from code[pc].

On the other hand, the computed goto version does this:

Execute the operation itself
pc++
Jump through the jump table based on offset computed from code[pc].

The difference between the two is obviously the "bounds check" step of the switch. Why is it required? You may think that this is because of the default clause, but that isn't true. Even without the default clause, the compiler is forced to generate the bounds check for the switch statement to conform to the C standard. Quoting from C99:

If no converted case constant expression matches and there is no default label, no part of the switch body is executed.

Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

Branch prediction

Modern CPUs have deep instruction pipelines and go to great lengths ensuring that the pipelines stay as full as possible. One thing that can ruin a pipeline's day is a branch, which is why branch predictors exist. Put simply (read the linked Wikipedia article for more details), it's an algorithm used by the CPU to try to predict in advance whether a branch will be taken or not. Since a CPU can easily pre-fetch instructions from the branch's target, successful prediction can make the pre-fetched instructions valid and there is no need to fully flush the pipeline.

The thing with branch predictors is that they map branches based on their addresses. Since the switch statement has a single "master jump" that dispatches all opcodes, predicting its destination is quite difficult. On the other hand, the computed goto statement is compiled into a separate jump per opcode, so given that instructions often come in pairs it's much easier for the branch predictor to "home in" on the various jumps correctly.

Think about it this way: for each jump, the branch predictor keeps a prediction of where it will jump next. If there's a jump per opcode, this is equivalent to predicting the second opcode in an opcode pair, which actually has some chance of success from time to time. On the other hand, if there's just a single jump, the prediction is shared between all opcodes and they keep stepping on each other's toes with each iteration.

I can't say for sure which one of the two factors weighs more in the speed difference between the switch and the computed goto, but if I had to guess I'd say it's the branch prediction.

What is done in other VMs?

So this post started by mentioning that the Python implementation uses a computed goto in its bytecode interpreter. What about other VMs?

Ruby 1.9 (YARV): also uses computed goto.
Dalvik (the Android Java VM): computed goto
Lua 5.2: uses a switch
Finally, if you want to take a look at a simple, yet realistic VM, I invite you to examine the source code of Bobscheme - my own Scheme implementation. The "barevm" component (a bytecode interpreter in C++) uses a switch to do the dispatching.

Bonus: detailed disassembly of interp_switch

Here's an annotated disassembly of the interp_switch function. The code was compiled with gcc, enabling full optimizations (-O3).

0000000000400650 <interp_switch>:

Per the System V x64 ABI, "code" is in %rdi, "initval" is in %rsi,

the returned value is in %eax.

400650: 89 f0 mov %esi,%eax

This an other NOPx instructions are fillers used for aligning other

instructions.

400652: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)

This is the main entry to the loop.

If code[pc] <= 6, go to the jump table. Otherwise, proceed to return

from the function.

400658: 80 3f 06 cmpb $0x6,(%rdi) 40065b: 76 03 jbe 400660 <interp_switch+0x10>

Return. This also handles OP_HALT

40065d: f3 c3 repz retq 40065f: 90 nop

Put code[pc] in %edx and jump through the jump table according to

its value.

400660: 0f b6 17 movzbl (%rdi),%edx 400663: ff 24 d5 20 0b 40 00 jmpq *0x400b20(,%rdx,8) 40066a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)

Handle OP_ADD7

400670: 83 c0 07 add $0x7,%eax 400673: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

pc++, and back to check the next opcode.

400678: 48 83 c7 01 add $0x1,%rdi 40067c: eb da jmp 400658 <interp_switch+0x8> 40067e: 66 90 xchg %ax,%ax

Handle OP_DIV2

400680: 89 c2 mov %eax,%edx 400682: c1 ea 1f shr $0x1f,%edx 400685: 8d 04 02 lea (%rdx,%rax,1),%eax 400688: d1 f8 sar %eax 40068a: eb ec jmp 400678 <interp_switch+0x28> 40068c: 0f 1f 40 00 nopl 0x0(%rax)

Handle OP_MUL2

400690: 01 c0 add %eax,%eax 400692: eb e4 jmp 400678 <interp_switch+0x28>

Handle OP_DEC

400694: 0f 1f 40 00 nopl 0x0(%rax) 400698: 83 e8 01 sub $0x1,%eax 40069b: eb db jmp 400678 <interp_switch+0x28> 40069d: 0f 1f 00 nopl (%rax)

Handle OP_INC

4006a0: 83 c0 01 add $0x1,%eax 4006a3: eb d3 jmp 400678 <interp_switch+0x28> 4006a5: 0f 1f 00 nopl (%rax)

Handle OP_NEG

4006a8: f7 d8 neg %eax 4006aa: eb cc jmp 400678 <interp_switch+0x28> 4006ac: 0f 1f 40 00 nopl 0x0(%rax)

How did I figure out which part of the code handles which opcode? Note that the "table jump" is done with:

jmpq *0x400b20(,%rdx,8)

This takes the value in %rdx, multiplies it by 8 and uses the result as an offset from 0x400b20. So the jump table itself is contained at address 0x400b20, which can be seen by examining the .rodata section of the executable:

$ readelf -x .rodata interp_compute_gotos

Hex dump of section '.rodata': 0x00400b00 01000200 00000000 00000000 00000000 ................ 0x00400b10 00000000 00000000 00000000 00000000 ................ 0x00400b20 5d064000 00000000 a0064000 00000000 ].@.......@..... 0x00400b30 98064000 00000000 90064000 00000000 ..@.......@..... 0x00400b40 80064000 00000000 70064000 00000000 ..@.....p.@..... 0x00400b50 a8064000 00000000 01010306 02020405 ..@.............

Reading the 8-byte values starting at 0x400b20, we get the mapping:

0x0 (OP_HALT) -> 0x40065d 0x1 (OP_INC) -> 0x4006a0 0x2 (OP_DEC) -> 0x400698 0x3 (OP_MUL2) -> 0x400690 0x4 (OP_DIV2) -> 0x400680 0x5 (OP_ADD7) -> 0x400670 0x6 (OP_NEG) -> 0x4006a8

Bonus: detailed disassembly of interp_cgoto

Similarly to the above, here is an annotated disassembly of the interp_cgoto function. I'll leave out stuff explained in the earlier snippet, trying to focus only on the things unique to the computed goto implementation.

00000000004006b0 <interp_cgoto>: 4006b0: 0f b6 07 movzbl (%rdi),%eax

Move the jump address indo %rdx from the jump table

4006b3: 48 8b 14 c5 e0 0b 40 mov 0x400be0(,%rax,8),%rdx 4006ba: 00 4006bb: 89 f0 mov %esi,%eax

Jump through the dispatch table.

4006bd: ff e2 jmpq *%rdx 4006bf: 90 nop

Return. This also handles OP_HALT

4006c0: f3 c3 repz retq 4006c2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)

Handle OP_INC.

The pattern here repeats for handling other instructions as well.

The next opcode is placed into %edx (note that here the compiler

chose to access the next opcode by indexing code[1] and only later

doing code++.

Then the operation is done (here, %eax += 1) and finally a jump

through the table to the next instruction is performed.

4006c8: 0f b6 57 01 movzbl 0x1(%rdi),%edx 4006cc: 83 c0 01 add $0x1,%eax 4006cf: 48 8b 14 d5 e0 0b 40 mov 0x400be0(,%rdx,8),%rdx 4006d6: 00 4006d7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 4006de: 00 00 4006e0: 48 83 c7 01 add $0x1,%rdi 4006e4: ff e2 jmpq *%rdx 4006e6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 4006ed: 00 00 00

Handle OP_DEC

4006f0: 0f b6 57 01 movzbl 0x1(%rdi),%edx 4006f4: 83 e8 01 sub $0x1,%eax 4006f7: 48 8b 14 d5 e0 0b 40 mov 0x400be0(,%rdx,8),%rdx 4006fe: 00 4006ff: 48 83 c7 01 add $0x1,%rdi 400703: ff e2 jmpq *%rdx 400705: 0f 1f 00 nopl (%rax)

Handle OP_MUL2

400708: 0f b6 57 01 movzbl 0x1(%rdi),%edx 40070c: 01 c0 add %eax,%eax 40070e: 48 8b 14 d5 e0 0b 40 mov 0x400be0(,%rdx,8),%rdx 400715: 00 400716: 48 83 c7 01 add $0x1,%rdi 40071a: ff e2 jmpq *%rdx 40071c: 0f 1f 40 00 nopl 0x0(%rax)

Handle OP_DIV2

400720: 89 c2 mov %eax,%edx 400722: c1 ea 1f shr $0x1f,%edx 400725: 8d 04 02 lea (%rdx,%rax,1),%eax 400728: 0f b6 57 01 movzbl 0x1(%rdi),%edx 40072c: d1 f8 sar %eax 40072e: 48 8b 14 d5 e0 0b 40 mov 0x400be0(,%rdx,8),%rdx 400735: 00 400736: 48 83 c7 01 add $0x1,%rdi 40073a: ff e2 jmpq *%rdx 40073c: 0f 1f 40 00 nopl 0x0(%rax)

Handle OP_ADD7

400740: 0f b6 57 01 movzbl 0x1(%rdi),%edx 400744: 83 c0 07 add $0x7,%eax 400747: 48 8b 14 d5 e0 0b 40 mov 0x400be0(,%rdx,8),%rdx 40074e: 00 40074f: 48 83 c7 01 add $0x1,%rdi 400753: ff e2 jmpq *%rdx 400755: 0f 1f 00 nopl (%rax)

Handle OP_NEG

400758: 0f b6 57 01 movzbl 0x1(%rdi),%edx 40075c: f7 d8 neg %eax 40075e: 48 8b 14 d5 e0 0b 40 mov 0x400be0(,%rdx,8),%rdx 400765: 00 400766: 48 83 c7 01 add $0x1,%rdi 40076a: ff e2 jmpq *%rdx 40076c: 0f 1f 40 00 nopl 0x0(%rax)

Again, if we use readelf to look at address 0x400be0, we see the contents of the jump table, and infer the addresses which handle the various opcodes:

0x0 (OP_HALT) -> 0x4006c0 0x1 (OP_INC) -> 0x4006c8 0x2 (OP_DEC) -> 0x4006f0 0x3 (OP_MUL2) -> 0x400708 0x4 (OP_DIV2) -> 0x400720 0x5 (OP_ADD7) -> 0x400740 0x6 (OP_NEG) -> 0x400758

[1]	To the best of my knowledge, it's supported by other major compilers such as ICC and Clang, but not by Visual C++.

[2]	Note that the `while` loop here isn't really necessary because the looping is implicitly handled by the `goto` dispatching. I'm leaving it in just for visual consistency with the previous sample.

For comments, please send me an email.

Hacker Times

Hacker Times

Computed goto for efficient dispatch tables (2012)

Discussion

Discussion

Defining a simple bytecode VM

Computed gotos

The simple VM implemented with a computed goto

Benchmarking

Why is it faster?

Doing less per iteration

Branch prediction

What is done in other VMs?

Bonus: detailed disassembly of interp_switch

Per the System V x64 ABI, "code" is in %rdi, "initval" is in %rsi,

the returned value is in %eax.

This an other NOPx instructions are fillers used for aligning other

instructions.

This is the main entry to the loop.

If code[pc] <= 6, go to the jump table. Otherwise, proceed to return

from the function.

Return. This also handles OP_HALT

Put code[pc] in %edx and jump through the jump table according to

its value.

Handle OP_ADD7

pc++, and back to check the next opcode.

Handle OP_DIV2

Handle OP_MUL2

Handle OP_DEC

Handle OP_INC

Handle OP_NEG

Bonus: detailed disassembly of interp_cgoto

Move the jump address indo %rdx from the jump table

Jump through the dispatch table.

Return. This also handles OP_HALT

Handle OP_INC.

The pattern here repeats for handling other instructions as well.

The next opcode is placed into %edx (note that here the compiler

chose to access the next opcode by indexing code[1] and only later

doing code++.

Then the operation is done (here, %eax += 1) and finally a jump

through the table to the next instruction is performed.

Handle OP_DEC

Handle OP_MUL2

Handle OP_DIV2

Handle OP_ADD7

Handle OP_NEG