←All posts

Inside Bytecode VMs β€” From Interpreters to JIT Compilers

A deep dive into bytecode virtual machines like Python's CPython and JVM. We trace the entire journey from source code to bytecode compilation, explore stack machines vs register machines, interpreter implementation patterns, and finally JIT compiler optimizations that bring dynamic languages close to native speed.

VMPythonJavaJITRuntime

You've probably seen programming languages classified as either "compiled" or "interpreted." But most major languages today β€” Python, Java, C#, JavaScript, Ruby, Lua β€” actually sit somewhere in between, using a bytecode VM (virtual machine) architecture.

In this article, we'll take an exhaustive look at how bytecode VMs work. We'll trace the journey from source code to bytecode, explore how VMs execute bytecode, and dive deep into JIT compilers that optimize execution at runtime.

Execution Models β€” Compiled, Interpreted, and VM-Based

Let's start by classifying how programs are executed.

Native Compilation

Languages like C, C++, Rust, and Go compile source code directly into machine code (native code). The resulting binary is executed directly by the CPU, yielding the highest possible performance.

Source code (.c) β†’ Compiler β†’ Machine code (.exe) β†’ CPU executes directly

The upside is top-tier performance, but the binary is platform-specific β€” you need to recompile for each OS and architecture.

Pure Interpretation

Early BASIC and shell scripts work by reading and executing source code line by line.

Source code β†’ Interpreter reads and executes one line at a time

This is platform-independent and convenient, but parsing source code on every execution makes it the slowest approach.

Bytecode VM

This is the model used by Python, Java, C#, and many others. Source code is first compiled to an intermediate representation (bytecode), which is then executed by a virtual machine (VM).

Source code β†’ Compiler β†’ Bytecode β†’ VM executes

This approach offers several advantages:

  • Portability: Bytecode is platform-independent. It runs anywhere the VM exists (Java's "Write once, run anywhere")
  • Faster than pure interpretation: Source code is parsed only once. Bytecode is a compact, efficient format
  • Room for optimization: A JIT compiler can achieve near-native speed at runtime

The Compilation Pipeline β€” From Source Code to Bytecode

Let's trace the bytecode generation process, using Python as our example.

Lexical Analysis (Lexing / Tokenization)

The source code string is converted into a sequence of tokens (lexemes).

# Source code
x = 1 + 2 * 3
Token sequence:
  NAME("x") β†’ EQUAL("=") β†’ NUMBER("1") β†’ PLUS("+") 
  β†’ NUMBER("2") β†’ STAR("*") β†’ NUMBER("3") β†’ NEWLINE

The lexer skips whitespace and comments while splitting the source into the smallest meaningful units.

Parsing

The token sequence is transformed into an Abstract Syntax Tree (AST) β€” a tree structure representing the program's structure.

       Assign
       /    \
    Name   BinOp(+)
     "x"   /      \
        Num(1)   BinOp(*)
                 /      \
              Num(2)   Num(3)

The parser follows the language's grammar rules, correctly reflecting operator precedence (* before +) in the tree structure.

Semantic Analysis

Additional checks and information are attached to the AST:

  • Variable scope resolution (local, global, or closure capture)
  • Type checking (for statically typed languages)
  • Simple optimizations like constant folding

In Python, symbol table construction happens at this stage, determining which scope each variable belongs to.

Code Generation

The AST (or its intermediate representation) is traversed to generate bytecode instructions.

# Examining actual bytecode in Python
import dis
 
def example():
    x = 1 + 2 * 3
 
dis.dis(example)
  2           RESUME                   0
 
  3           LOAD_CONST               1 (7)
              STORE_FAST               0 (x)
              RETURN_CONST             0 (None)

Several things are worth noting:

  1. Constant folding: 1 + 2 * 3 has been precomputed to 7. CPython's compiler evaluates constant expressions at the AST stage
  2. LOAD_CONST / STORE_FAST: The bytecode is based on a stack machine model (more on this later)
  3. RETURN_CONST: Python functions implicitly return None when there's no explicit return

Let's look at a slightly more complex example:

def loop_example():
    total = 0
    for i in range(10):
        total += i
    return total
 
dis.dis(loop_example)
  2           RESUME                   0
 
  3           LOAD_CONST               1 (0)
              STORE_FAST               0 (total)
 
  4           LOAD_GLOBAL              1 (range)
              LOAD_CONST               2 (10)
              CALL                     1
              GET_ITER
        >>    FOR_ITER                 7 (to 36)
              STORE_FAST               1 (i)
 
  5           LOAD_FAST                0 (total)
              LOAD_FAST                1 (i)
              BINARY_OP               13 (+=)
              STORE_FAST               0 (total)
              JUMP_BACKWARD            9 (to 18)
 
        >>    END_FOR
              POP_TOP
 
  6           LOAD_FAST                0 (total)
              RETURN_VALUE

The for loop is translated into a GET_ITER β†’ FOR_ITER β†’ JUMP_BACKWARD bytecode pattern. This is the actual instruction sequence the VM executes.

The interactive demo below lets you step through bytecode execution one instruction at a time. Observe how the instruction pointer (IP), operand stack, and local variables change at each step.

Bytecode Execution Simulator

Step through bytecode execution of a function that starts with total=0, i=1 and repeats total += i; i += 1 while i < 3.

0β†’RESUME0
1β†’LOAD_CONST0 (0)
2β†’STORE_FAST0 (total)
3β†’LOAD_CONST1 (1)
4β†’STORE_FAST1 (i)
5β†’LOAD_FAST0 (total)
6β†’LOAD_FAST1 (i)
7β†’BINARY_OP+=
8β†’STORE_FAST0 (total)
9β†’LOAD_FAST1 (i)
10β†’LOAD_CONST2 (1)
11β†’BINARY_OP+
12β†’STORE_FAST1 (i)
13β†’LOAD_FAST1 (i)
14β†’LOAD_CONST3 (3)
15β†’COMPARE_OP<
16β†’POP_JUMP_IF_TRUE5
17β†’LOAD_FAST0 (total)
18β†’RETURN_VALUE
Operand Stack
(empty)
Local Variables
(none)
IP
0
1 / 26

Bytecode Structure

Bytecode is essentially a byte sequence of instruction (opcode) and operand pairs.

Opcodes and Operands

Bytecode instruction:
  [opcode] [operand]
  1 byte   0 to several bytes
 
Example (CPython 3.12+, base instruction = 1 word = 2 bytes; some have inline cache entries):
  LOAD_CONST    1     β†’ Push value at index 1 of constant table onto stack
  STORE_FAST    0     β†’ Store stack top into local variable slot 0
  BINARY_OP     0     β†’ Pop 2 values from stack, add, push result

Since CPython 3.6, all instructions use a word code format (2 bytes = 1 byte opcode + 1 byte operand). CPython 3.12 and later extended some instructions to 4 bytes. When an operand exceeds 255, an EXTENDED_ARG instruction is prepended to extend the operand.

Code Objects

In Python, compiled bytecode is stored as a code object.

def add(a, b):
    return a + b
 
code = add.__code__
print(code.co_code)        # Raw byte sequence
print(code.co_consts)      # Constant table: (None,)
print(code.co_varnames)    # Local variable names: ('a', 'b')
print(code.co_stacksize)   # Required stack depth: 2
print(code.co_nlocals)     # Number of local variables: 2

Code objects contain not just the instruction sequence but all metadata the VM needs: constant tables, variable name tables, stack size information, and more.

.pyc Files β€” Bytecode Caching

Python caches compiled bytecode as .pyc files when importing modules.

__pycache__/
  module.cpython-312.pyc

.pyc file format:

OffsetSizeContent
04 bytesMagic number (identifies Python version)
44 bytesBit field (invalidation mode flags)
88 bytesSource file timestamp or hash
16onwardMarshalled code object

On subsequent imports, if the .pyc is up to date, Python skips recompilation and loads the bytecode directly, reducing startup time.

Stack Machines vs Register Machines

VM architectures fall into two major categories: stack machines and register machines.

Stack Machines

This is the model used by CPython, JVM, and CLR. All operations go through an operand stack.

Expression: a + b * c bytecode (stack machine)
 
Instruction       Stack state (left = bottom)
────────────────────────────────────────
LOAD a           [a]
LOAD b           [a, b]
LOAD c           [a, b, c]
MUL              [a, (b*c)]
ADD              [(a+b*c)]

Pros:

  • Compact instructions (no need to specify register numbers in operands)
  • Easy code generation (direct post-order AST traversal)
  • Simple instruction set makes VM implementation straightforward

Cons:

  • More instructions needed for the same computation (frequent LOAD/STORE)
  • Stack reads/writes become memory accesses with associated overhead

Register Machines

This model is used by Lua (5.0+), Dalvik VM (Android), and some JS engines. Operations use an array of virtual registers.

Expression: a + b * c bytecode (register machine)
 
Instruction                Register state
────────────────────────────────────────
MUL  R1, R_b, R_c         R1 = b * c
ADD  R0, R_a, R1           R0 = a + b * c

Pros:

  • Fewer instructions (one instruction can specify multiple operands)
  • Less data movement, reducing interpreter dispatch overhead

Cons:

  • Larger instructions (register numbers require extra bits in operands)
  • More complex code generation (register allocation needed)

Real-World VM Choices

VMArchitectureLanguage
CPythonStack machinePython
JVMStack machineJava, Kotlin, Scala, Clojure
CLRStack machineC#, F#, VB.NET
Lua VMRegister machineLua
DalvikRegister machineJava (Android, replaced by ART in Android 5.0)
BEAMRegister machine (hybrid)Erlang, Elixir

Research has shown that register machines can reduce instruction dispatch count by 47% compared to stack machines (Yunhe Shi et al., "Virtual Machine Showdown: Stack Versus Registers", VEE 2005 / ACM TACO 2008). However, bytecode size is roughly 26% larger for register machines.

The demo below executes the same expression a + b Γ— c on a stack machine and register machine simultaneously, so you can feel the difference in instruction count.

Stack Machine vs Register Machine

Compare stack and register machine execution side by side, computing a + b Γ— c.

Stack Machine(6 instructions)
LOAD a
LOAD b
LOAD c
MUL
ADD
STORE result
Stack
(empty)
Register Machine(3 instructions)
MUL R1, b, c
ADD R0, a, R1
MOV result, R0
Registers
a2
b3
c4
R0-
R1-
1 / 7

Interpreter Implementation Patterns

The interpreter β€” the core of the VM that executes bytecode β€” can be implemented using several patterns.

Switch-based Interpreter

The simplest implementation. A giant switch statement handles each opcode.

// Simplified VM loop (C language)
void eval(uint8_t *bytecode, Value *stack) {
    uint8_t *ip = bytecode;   // Instruction Pointer
    Value *sp = stack;         // Stack Pointer
 
    for (;;) {
        uint8_t opcode = *ip++;
        switch (opcode) {
            case OP_LOAD_CONST: {
                uint8_t idx = *ip++;
                *sp++ = constants[idx];
                break;
            }
            case OP_ADD: {
                Value b = *--sp;
                Value a = *--sp;
                *sp++ = value_add(a, b);
                break;
            }
            case OP_STORE_FAST: {
                uint8_t slot = *ip++;
                locals[slot] = *--sp;
                break;
            }
            case OP_RETURN: {
                return;
            }
            // ... hundreds of opcodes
        }
    }
}

Pros: Simple to implement and understand. The historical core of CPython.

Cons: Poor CPU branch prediction. All opcodes share the same switch branch point, so the branch predictor can't predict "which opcode comes next," causing frequent pipeline stalls.

Direct Threaded Interpreter

Uses GCC's computed goto extension to jump directly to each opcode handler's address.

void eval_threaded(uint8_t *bytecode, Value *stack) {
    // Address table for each handler
    static void *dispatch_table[] = {
        [OP_LOAD_CONST] = &&op_load_const,
        [OP_ADD]        = &&op_add,
        [OP_STORE_FAST] = &&op_store_fast,
        [OP_RETURN]     = &&op_return,
        // ...
    };
 
    uint8_t *ip = bytecode;
    Value *sp = stack;
 
    #define DISPATCH() goto *dispatch_table[*ip++]
 
    DISPATCH();  // Jump to first instruction
 
op_load_const: {
        uint8_t idx = *ip++;
        *sp++ = constants[idx];
        DISPATCH();
    }
op_add: {
        Value b = *--sp;
        Value a = *--sp;
        *sp++ = value_add(a, b);
        DISPATCH();
    }
op_store_fast: {
        uint8_t slot = *ip++;
        locals[slot] = *--sp;
        DISPATCH();
    }
op_return:
        return;
}

Pros: Since each handler ends with a different indirect jump, the CPU branch predictor can maintain independent entries per handler. Reported 15-25% speedup over switch-based implementations.

Cons: Depends on GCC/Clang extensions (not standard C). Not supported by MSVC.

CPython 3.x detects at compile time whether computed goto is available and uses direct threading when it is.

Token Threaded / Subroutine Threaded

Beyond direct threading, there are other threading approaches:

  • Token Threaded: Keeps bytecode as opcode indices and looks up the dispatch table at dispatch time. Produces more compact bytecode than direct threading
  • Subroutine Threaded: Implements each opcode handler as a function and converts the bytecode sequence into an array of function pointers. Uses call/ret instructions, which benefit from the CPU's Return Stack Buffer (RSB) for branch prediction

Specializing Adaptive Interpreter

A groundbreaking optimization introduced in CPython 3.11. It specializes bytecode instructions based on runtime data type patterns.

Generic instruction:     BINARY_OP  ADD
                           ↓ Runtime detects "int + int" pattern
Specialized instruction: BINARY_OP_ADD_INT

How it works:

  1. Each instruction gets an adaptive counter
  2. After an instruction executes a certain number of times, the VM analyzes recent operand type patterns
  3. If the pattern is stable, the generic instruction is rewritten to a more efficient specialized instruction
  4. If an unexpected type appears, deoptimization reverts to the generic instruction

Key specialized instructions (CPython 3.12+):

Generic InstructionSpecialized InstructionCondition
LOAD_ATTRLOAD_ATTR_INSTANCE_VALUEInstance attribute dict access
LOAD_ATTRLOAD_ATTR_SLOTClass using __slots__
LOAD_ATTRLOAD_ATTR_MODULEModule attribute
BINARY_OPBINARY_OP_ADD_INTBoth operands are int
BINARY_OPBINARY_OP_ADD_FLOATBoth operands are float
BINARY_OPBINARY_OP_ADD_UNICODEBoth operands are str
COMPARE_OPCOMPARE_OP_INTInteger comparison
CALLCALL_PY_EXACT_ARGSPython function call (exact arg count)
FOR_ITERFOR_ITER_LISTList iteration
UNPACK_SEQUENCEUNPACK_SEQUENCE_TWO_TUPLE2-element tuple unpack

This optimization made CPython 3.11 on average 25% faster than 3.10. Achieving this level of improvement through interpreter-only techniques β€” without JIT β€” was remarkable.

Frames and the Call Stack

Let's examine how VMs manage function calls.

Frame Objects

Each time a function is called, the VM creates a frame (stack frame / execution frame).

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Frame                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Code object (code)             β”‚  β†’ Bytecode instructions
β”‚  Instruction pointer (IP)       β”‚  β†’ Current execution position
β”‚  Local variable array (locals)  β”‚  β†’ [a, b, result, ...]
β”‚  Operand stack                  β”‚  β†’ Temporary stack for operations
β”‚  Link to previous frame         β”‚  β†’ Caller's frame
β”‚  Exception handler stack        β”‚  β†’ try/except info
β”‚  Builtins namespace reference   β”‚  β†’ builtins
β”‚  Global namespace reference     β”‚  β†’ globals
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Call Stack Growth

def outer():
    x = 10
    return inner(x)
 
def inner(n):
    return n * 2
Runtime call stack:
 
inner(10) frame  ← Currently executing
  code: inner.__code__
  locals: [n=10]
  stack: []
──────────────────
outer() frame
  code: outer.__code__
  locals: [x=10]
  stack: []
──────────────────
<module> frame
  code: <module>.__code__
  locals: []
  stack: []

When inner executes return, its frame is discarded and control returns to outer's frame.

CPython limits the frame stack depth via sys.getrecursionlimit() (default 1000). This is a safety valve to prevent segmentation faults from stack overflow.

Frame Optimizations

Frame creation and destruction are overhead, so VMs apply various optimizations:

  • CPython 3.11+: Inlined frames (_PyInterpreterFrame). Avoids heap allocation by placing frames directly on the C stack
  • JVM: Frames are placed on the native stack. The JIT compiler may completely eliminate function calls through inlining
  • V8: The JIT compiler merges frames via inlining and optimizes stack frame layout for deoptimization metadata

Object Representation and Tagged Values

How does a VM internally represent values in a dynamically typed language? This directly affects VM performance.

Boxing

In CPython, every value is represented as a pointer to a PyObject struct. Even the integer 42 is a heap-allocated object.

// CPython's PyObject struct (simplified)
typedef struct {
    Py_ssize_t ob_refcnt;    // Reference count
    PyTypeObject *ob_type;   // Pointer to type
} PyObject;
 
// Python integers are arbitrary precision, so they use a variable-length digit array
typedef struct {
    PyObject_VAR_HEAD            // ob_refcnt + ob_type + ob_size
    digit ob_digit[1];           // Arbitrary-precision digit array (30 bits/digit)
} PyLongObject;
// Note: CPython 3.12+ introduced a "compact" representation for small ints,
//   storing the value in ob_size bit fields and omitting ob_digit

Python's int is an arbitrary-precision integer that can represent values of any size. Even a small integer like 42 requires: reference count (8 bytes) + type pointer (8 bytes) + size (8 bytes) + digit array + memory allocator overhead = at least 28 bytes. Compare this to C's int at just 4 bytes β€” quite a luxurious use of memory.

NaN Boxing

Some VMs exploit the IEEE 754 NaN (Not a Number) bit pattern to store both a type tag and value within a 64-bit double.

IEEE 754 double (64 bits):
  sign(1) | exponent(11) | mantissa(52)
 
NaN bit pattern:
  x | 11111111111 | 1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                     ↑ quiet NaN bit
 
Value storage:
  - Floating point: stored directly as IEEE 754 bit pattern
  - Integers, pointers, booleans: stored in NaN mantissa with tag bits + value
NaN Boxing layout example:
  [0][11111111111][1][TTT][PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP]
                       ↑ tag (3 bits)  ↑ payload (48 bits)
 
Tags:
  000 = pointer (object)
  001 = 32-bit integer
  010 = boolean
  011 = nil/null
  ...

LuaJIT, JavaScriptCore (Safari), and SpiderMonkey (Firefox) use this technique. In SpiderMonkey, NaN boxing is the primary value representation used throughout the engine. Since values fit in a 64-bit word on the stack, heap allocation and pointer indirection are eliminated, yielding significant performance gains.

Tagged Pointers

On 64-bit architectures, alignment ensures that the lower few bits of pointers are always zero. These spare bits can be used as type tags.

Normal pointer (8-byte alignment):
  PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP000
                                                                ↑ always 0 (lower 3 bits)
 
Tagged pointer:
  PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP|TTT|
  ↑ pointer portion (upper 61 bits)                             ↑ tag (lower 3 bits)

V8 (Google's JavaScript engine) uses tagged pointers to distinguish Smi (Small Integer) from HeapObject:

  • Least significant bit is 0 β†’ Smi (31-bit integer stored in upper 31 bits)
  • Least significant bit is 1 β†’ HeapObject pointer (mask off tag bit to dereference)

CRuby (MRI) uses a similar technique, representing Fixnum (small integers) without heap allocation.

JIT Compilers β€” Runtime Native Code Generation

This is the heart of the article. A JIT (Just-In-Time) compiler compiles bytecode to native machine code during program execution, dramatically speeding up subsequent execution by running native code directly.

Why JIT Is Needed

When interpreting bytecode, dozens to hundreds of native instructions execute for each bytecode instruction.

Bytecode: BINARY_OP_ADD_INT
 
What the interpreter actually does (outline):
  1. Fetch opcode (memory read)
  2. Get handler address from dispatch table
  3. Jump to handler
  4. Pop operand B from stack
  5. Pop operand A from stack
  6. Type check A (is it int?)
  7. Type check B (is it int?)
  8. Overflow check
  9. Perform addition
  10. Push result onto stack
  11. Advance to next opcode
 
When JIT compiles to native code:
  add rax, rbx    ← Just 1 instruction (or a few)

For integer addition, the interpreter incurs tens of times more overhead. JIT eliminates this overhead.

JIT Compilation Flow

  Source code
      ↓
  Compile to bytecode
      ↓
  Start executing in interpreter
      ↓
  Profiling (collect type info, execution frequency)
      ↓
  Hot spot detection (frequently executed code)
      ↓
  JIT compile (bytecode β†’ IR β†’ optimization β†’ native code)
      ↓
  Execute native code (bypass interpreter on subsequent runs)
      ↓
  If assumptions break, deoptimize (native code β†’ back to interpreter)

Method JIT vs Tracing JIT

JIT compilers fall into two major approaches.

Method JIT

Compiles at the function (method) level. JVM (HotSpot C2), V8, and .NET's RyuJIT use this approach.

Function calculate() was called 1000 times
  β†’ Identified as "hot method"
  β†’ Compile entire function to native code
  β†’ Execute native code on subsequent calls

Pros: Clear compilation units make optimization straightforward. Pairs well with inlining. Cons: Compiling large functions takes time. Even if only part of a function is hot, the entire function is compiled.

Tracing JIT

Traces (records) execution paths and compiles the actual executed bytecode sequence. LuaJIT and the former Mozilla TraceMonkey used this approach.

Record loop execution path:
  LOAD x β†’ LOAD y β†’ ADD β†’ COMPARE β†’ BRANCH(true) β†’ STORE β†’ JUMP_BACK
 
Compile this "trace" to native code:
  - Assume branch is always true (verified by guard instructions)
  - Assume types match recorded types (verified by guard instructions)
  - Fall back to interpreter if a guard fails

Pros: Extremely effective for loop optimization. Only the actually executed path is compiled, eliminating dead code. Cross-function inlining happens naturally. Cons: Weak against complex branching patterns (frequent deoptimization from guard failures). Trace recording has overhead.

Type Specialization

The most important JIT optimization for dynamically typed languages is type specialization.

def add(a, b):
    return a + b

This function can accept int, float, str, list, or any other type. But if profiling reveals "this function is always called with int," the JIT can specialize:

Generic version (interpreter):
  1. Get type of a
  2. Get type of b
  3. Look up __add__ method for the type
  4. Call __add__
  5. Box result and return
 
Specialized version (JIT, assuming int):
  ; Guard: verify a is int
  test [rdi + type_offset], INT_TAG
  jne deoptimize
  ; Guard: verify b is int
  test [rsi + type_offset], INT_TAG
  jne deoptimize
  ; Direct addition
  mov rax, [rdi + value_offset]
  add rax, [rsi + value_offset]
  ; Overflow check
  jo  deoptimize
  ret

Type checking overhead is minimized through guard instructions. If the guard succeeds, the optimized path runs; if it fails, deoptimization returns to the interpreter.

Inlining

One of the most powerful JIT compiler optimizations. It replaces function calls with the called function's body.

def square(x):
    return x * x
 
def sum_of_squares(a, b):
    return square(a) + square(b)

The JIT compiler transforms this into:

sum_of_squares native code (after inlining):
  ; Inline square(a)
  mov rax, [rdi]      ; value of a
  imul rax, rax       ; a * a
  mov rcx, rax        ; save result
 
  ; Inline square(b)
  mov rax, [rsi]      ; value of b
  imul rax, rax       ; b * b
 
  ; Addition
  add rax, rcx        ; a*a + b*b
  ret

Benefits of inlining:

  • Function call/return overhead is eliminated
  • Further optimizations (constant propagation, dead code elimination) can be applied to the inlined code
  • Optimizations can use the caller's context (type information)

Deoptimization

When assumptions made by the JIT are violated during execution, native code execution must be suspended and control returned to the interpreter. This is deoptimization.

def process(x):
    return x + 1
 
# First 10,000 calls with int β†’ JIT generates int version
for i in range(10000):
    process(i)
 
# Suddenly called with float β†’ guard fails, deoptimization occurs
process(3.14)

The deoptimization process:

  1. Guard instruction detects failure
  2. Capture current machine register state
  3. Reconstruct interpreter frame (stack, local variables, instruction pointer)
  4. Return control to interpreter
  5. Update profile information and set recompilation trigger

Deoptimization is very expensive, but if it's rare enough, overall JIT performance still improves. The problem is when deoptimization happens frequently β€” the compile β†’ deoptimize β†’ recompile cycle can make things slower.

On-Stack Replacement (OSR)

Normally, JIT compilation takes effect on a function's next invocation. But for long-running loops, you can't wait until the function is called again.

OSR is a technique that replaces the currently executing interpreter frame with JIT-compiled native code mid-execution.

Running loop:
  for i in range(1000000):  ← During loop iteration...
      ...                    
                             β˜… JIT compilation complete!
                             β˜… OSR: Switch from interpreter to native code
                             β˜… Remaining loop iterations run in native code

What OSR requires:

  1. Capture interpreter frame state (local variables, stack, loop counter)
  2. Map to native code frame layout
  3. Jump to the appropriate position in the native code loop body

JVM's HotSpot, V8, and SpiderMonkey all support OSR.

Multi-Tier JIT Compilation

JIT compilation involves a tradeoff between compilation time and optimization level. Compile quickly with low optimization, or spend time for high optimization β€” it's a dilemma.

Most practical JITs solve this with tiered compilation.

The interactive demo below shows how functions progress from interpreter to baseline JIT to optimizing JIT as they execute. You can also experience deoptimization and recompilation.

Multi-Tier JIT Compilation in Action

Watch 3 functions progress from interpreter β†’ baseline JIT β†’ optimizing JIT as they execute. Experience deoptimization and recompilation.

Interpreter
Baseline JIT
Optimizing JIT
Deoptimized!
calculate()
Interpreter
Call count: 0
helper()
Interpreter
Call count: 0
format()
Interpreter
Call count: 0
1 / 8

JVM HotSpot's Tiered Compilation

JVM's HotSpot has 5 tiers, but methods don't necessarily pass through all of them sequentially. The most common path is Tier 0 β†’ Tier 3 β†’ Tier 4 (interpreter β†’ C1 with full profiling β†’ C2):

Tier 0: Interpreter (collects profiling data)
  ↓ Method call count exceeds threshold (this is typically the first compilation)
Tier 3: C1, full profiling
  ↓ Sufficient profile data accumulated
Tier 4: C2 (Server compiler), aggressive optimization
 
Tiers used in special cases:
Tier 1: C1, no profiling (for trivial methods)
Tier 2: C1, limited profiling (fallback when C2 queue is full)
TierCompilerCompile SpeedOptimization LevelProfiling
0None (interpreter)-NoneYes
1C1FastBasicNo
2C1FastBasicLimited
3C1FastBasicFull
4C2SlowAggressive-

C1 (Client compiler) performs quick, basic optimizations for applications where short startup time matters. C2 (Server compiler) spends more time on aggressive optimizations for long-running server applications.

Key C2 optimizations:

  • Loop unrolling
  • Escape analysis for allocation elimination
  • Lock elision
  • Bounds check elimination
  • Dead code elimination
  • Vectorization (SIMD instructions)

V8's Tiered Compilation

Google's V8 JavaScript engine also uses tiered compilation.

Ignition (interpreter)
  ↓ Function becomes hot
Sparkplug (baseline compiler, V8 v9.1+)
  ↓ Further profiling
Maglev (mid-tier compiler, Chrome 117+ / V8 v11.7+)
  ↓ Sufficient profile data
TurboFan (optimizing compiler)
  • Ignition: Register-based bytecode interpreter. All code starts here
  • Sparkplug: Non-optimizing compiler that generates native code directly from bytecode. Extremely fast compilation (bypasses AST)
  • Maglev (Chrome 117+ / V8 v11.7+): Mid-tier compiler using SSA-based intermediate representation with type-feedback-based optimization
  • TurboFan: Full optimizing compiler using Sea of Nodes IR. Performs inlining, loop optimization, escape analysis, and more

GraalVM's Graal JIT

Oracle's GraalVM ships with a JIT compiler (Graal) that is implemented in Java itself.

Notable features:

  • Partial Escape Analysis: When an object escapes only on certain paths, allocation happens only on those paths β€” on other paths, it's kept as scalar values
  • Polyglot support (Truffle framework): Write an AST interpreter using Truffle's API (@Specialization, @NodeChild, etc.), and Graal JIT automatically applies optimization. Supports Python, Ruby, JavaScript, R, LLVM bitcode, and more
  • Native Image: AOT (Ahead-Of-Time) compilation produces standalone binaries with near-instant startup time (on the order of milliseconds)

Major VM Comparison

Let's compare today's major bytecode VMs.

CPython (Python)

Features:
  - Stack-based bytecode VM
  - GIL (Global Interpreter Lock) limits multi-threading
  - Reference counting + cycle detection GC
  - CPython 3.11: Specializing Adaptive Interpreter
  - CPython 3.13: Experimental JIT (copy-and-patch JIT)
  - CPython 3.13: Free-threaded mode (GIL disabled, experimental)

CPython 3.13's copy-and-patch JIT is an ultra-lightweight JIT that copies pre-prepared native code templates (stencils) and patches operands. It minimizes compilation time while eliminating interpreter dispatch overhead.

Importantly, the copy-and-patch JIT does not compile bytecodes directly. Instead, the Tier 2 optimizer first converts bytecodes to a micro-operation (ΞΌop) intermediate representation, and the JIT is applied to that ΞΌop sequence.

Copy-and-patch flow:
  1. At build time, generate native code templates (stencils) for each ΞΌop
  2. At runtime, the Tier 2 optimizer converts bytecodes to ΞΌop sequences
  3. Copy templates for hot ΞΌop sequences into memory
  4. Patch operands (constant values, variable slot numbers)
  5. Concatenate templates to form complete native code sequence
  6. Execute the concatenated template sequence directly

JVM (Java, Kotlin)

Features:
  - Stack-based bytecode VM
  - Tiered JIT (C1 + C2, or Graal)
  - Advanced GC (G1, ZGC, Shenandoah)
  - Bytecode verification for security guarantees
  - Dynamic class loading via ClassLoaders

The JVM bytecode verifier checks before execution:

  • No stack overflow/underflow
  • Local variable access is within bounds
  • Type consistency is maintained
  • finally blocks are properly connected

This prevents execution of malformed bytecode and guarantees VM memory safety.

CLR (C#, F#)

Features:
  - Stack-based CIL (Common Intermediate Language)
  - RyuJIT (JIT compiler)
  - Value types (struct) β€” avoids boxing
  - Span<T> / stackalloc for zero-allocation
  - NativeAOT for ahead-of-time compilation
  - Generational mark-and-sweep GC

CLR's standout feature is value type support. struct values are placed directly on the stack, significantly reducing heap allocation and GC pressure.

// Value type β€” placed on stack
struct Point { public int X, Y; }
 
// Reference type β€” heap allocated
class PointClass { public int X, Y; }
 
void Example() {
    Point p = new(1, 2);       // On the stack. No GC needed
    PointClass pc = new(1, 2); // Heap allocated. GC managed
}

LuaJIT

Features:
  - Register-based bytecode VM
  - Tracing JIT
  - NaN Boxing for value representation
  - Extremely fast (among the fastest dynamic language VMs)
  - Fast FFI (Foreign Function Interface)

LuaJIT, developed by Mike Pall, is one of the fastest tracing JITs in existence. Despite its compact codebase (~100K lines of C/assembly), it achieves near-C performance on many benchmarks.

How LuaJIT's tracing JIT works:

  1. Hot loop detection: A loop's back-edge execution count exceeding a threshold marks it as "hot"
  2. Trace recording: Records the hot loop's execution path as a "trace." Function calls are unfolded into the trace
  3. SSA conversion: Converts the trace to SSA (Static Single Assignment) form IR
  4. Optimization: Constant folding, algebraic simplification, CSE (Common Subexpression Elimination), DCE (Dead Code Elimination), loop-invariant code motion, register allocation
  5. Code generation: Generates native code directly from optimized IR (supports x86, x64, ARM, MIPS, PPC)

V8 (JavaScript)

Features:
  - Bytecode (Ignition) + tiered JIT
  - Hidden Classes for efficient object representation
  - Inline Caches (IC) for fast property access
  - Generational GC (Scavenger + Mark-Compact)
  - Tagged pointers

V8's Hidden Class (Map) system infers fixed layouts β€” similar to C++ classes β€” for dynamically typed objects.

// These two objects share the same Hidden Class
let a = { x: 1, y: 2 };
let b = { x: 3, y: 4 };
 
// This triggers a transition to a new Hidden Class
a.z = 5;

Objects with properties added in the same order share the same Hidden Class, allowing the JIT compiler to optimize property access based on it.

Concrete Example: Python Bytecode Execution Trace

Finally, let's trace exactly how Python bytecode is executed, step by step.

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)
import dis
dis.dis(fibonacci)
  2           RESUME                   0
 
  3           LOAD_FAST                0 (n)
              LOAD_CONST               1 (1)
              COMPARE_OP              26 (<=)
              POP_JUMP_IF_FALSE        1 (to 14)
 
  4           LOAD_FAST                0 (n)
              RETURN_VALUE
 
  5     >>    LOAD_GLOBAL              1 (fibonacci)
              LOAD_FAST                0 (n)
              LOAD_CONST               1 (1)
              BINARY_OP               10 (-)
              CALL                     1
              LOAD_GLOBAL              1 (fibonacci)
              LOAD_FAST                0 (n)
              LOAD_CONST               2 (2)
              BINARY_OP               10 (-)
              CALL                     1
              BINARY_OP                0 (+)
              RETURN_VALUE

Let's trace the execution of fibonacci(3), showing the stack changes:

fibonacci(3) execution trace:
 
Instruction             Stack              Notes
─────────────────────────────────────────────────────
RESUME                  []                 Frame initialization
LOAD_FAST 0 (n)         [3]                Push n=3
LOAD_CONST 1 (1)        [3, 1]             Push constant 1
COMPARE_OP <= (26)      [False]            3 <= 1 β†’ False
POP_JUMP_IF_FALSE       []                 False, so jump
LOAD_GLOBAL 1           [NULL, fibonacci]  Push NULL + function object
LOAD_FAST 0 (n)         [NULL, fibonacci, 3]  n=3
LOAD_CONST 1 (1)        [NULL, fibonacci, 3, 1]  Constant 1
BINARY_OP - (10)        [NULL, fibonacci, 2]  3 - 1 = 2
CALL 1                  [β†’ calls fibonacci(2)]
                        ... recursive call ...
                        [result_1]          fibonacci(2) result = 1
LOAD_GLOBAL 1           [1, NULL, fibonacci]
LOAD_FAST 0 (n)         [1, NULL, fibonacci, 3]
LOAD_CONST 2 (2)        [1, NULL, fibonacci, 3, 2]
BINARY_OP - (10)        [1, NULL, fibonacci, 1]  3 - 2 = 1
CALL 1                  [β†’ calls fibonacci(1)]
                        ... recursive call ...
                        [1, result_2]       fibonacci(1) result = 1
BINARY_OP + (0)         [2]                1 + 1 = 2
RETURN_VALUE            β†’ returns 2

Inline Caching β€” Speeding Up Dynamic Dispatch

In dynamically typed languages, every property access or method call requires checking the object's type and looking up the property's location. Inline Caching (IC) caches these lookup results for reuse.

IC State Transitions

Uninitialized
  ↓ First access
Monomorphic
  β†’ Only one type observed. Highest cache hit rate
  ↓ Different type observed
Polymorphic
  β†’ 2-4 types observed. Linear search through multiple cache entries
  ↓ More types observed
Megamorphic
  β†’ Too many types observed. Cache abandoned, falls back to generic path

Concrete Example

function getX(obj) {
    return obj.x;  // ← IC is set up at this call site
}
 
// Always called with same-shape objects β†’ Monomorphic (fastest)
getX({ x: 1, y: 2 });
getX({ x: 3, y: 4 });
 
// Mixed object shapes β†’ Polymorphic β†’ Megamorphic
getX({ x: 1 });
getX({ x: 1, y: 2, z: 3 });
getX({ a: 0, x: 1 });

A monomorphic IC can speed up property access to one comparison + one memory read. JIT compilers use monomorphic IC information to generate direct field access code with type guards.

Profiling Mechanics

For a JIT compiler to decide "which code to compile" and "which types to specialize for," profiling data collection is essential.

Profiling Approaches Compared

VMProfiling MethodMechanism
JVM HotSpotInvocation counter + backedge counterCounts method invocations and loop backedge executions. Decays over time to prioritize recent hot spots
V8Interrupt BudgetDecrements a budget per bytecode instruction. Triggers compilation when budget reaches 0
CPython 3.11+Per-instruction adaptive counterAttaches a counter to each bytecode instruction. Specialization is attempted when threshold is reached

HotSpot's counter decay is an important design decision. It prevents code that ran frequently during startup β€” but rarely during steady state β€” from being considered "hot" indefinitely.

GC–JIT Coordination β€” Safepoints

JIT-compiled native code must cooperate with the garbage collector. For the GC to scan and relocate heap objects, all threads must stop at a safe state (safepoint).

Safepoint Placement

JIT compilers insert safepoint polls into generated native code.

JIT-generated code (outline):
  ...
  add rax, rbx          ; computation
  test [safepoint_page], eax  ; β˜… safepoint poll
  ...                    ; when safepoint_page is made unreadable,
                         ; a page fault occurs β†’ control transfers to GC

In JVM HotSpot, safepoints are placed at:

  • Just before method return
  • Loop backedges (jumps from end to start of loop)
  • JNI / native method call transitions

For counted loops (loops with an int counter and a known bound), backedge safepoints may be omitted, which can cause unexpectedly long GC pauses (Time-To-Safepoint issues). Since JDK 17, loop strip mining is enabled by default, breaking counted loops into chunks with safepoint polls between them to mitigate this problem.

JIT Security β€” The W^X Problem

JIT compilation poses an inherent security challenge. Generating native code at runtime requires memory to be both writable (W) and executable (X), violating the W^X (Write XOR Execute) security policy.

Addressing the W^X Problem

ApproachDescriptionUsed By
W/X toggleMemory is W during code generation, then switched to XMost JITs
Double mappingMap the same physical page as W (for writing) and X (for execution) at two virtual addressesOpenJDK, V8
JIT prohibitionNo JIT β€” use only AOT or interpreteriOS apps (except Safari)

On iOS, Apple prohibits JIT for third-party apps. Safari and WKWebView (including third-party browsers like Chrome and Firefox) have JIT enabled, but apps that embed JavaScriptCore directly run with only the interpreter (LLInt) and no JIT tiers. This can impact JavaScript performance on iOS.

Conclusion

Bytecode VMs are an elegant architecture that balances portability with execution efficiency.

Source Code
  ↓ Lexing β†’ Parsing β†’ Semantic Analysis β†’ Code Generation
Bytecode (portable intermediate representation)
  ↓ VM executes
Interpreter (executes all code first)
  ↓ Profiling
JIT Compiler (converts hot spots to native code)
  ↓ Type specialization β†’ Inlining β†’ Various optimizations
Native Code (CPU executes directly)

Each VM makes different tradeoffs:

  • CPython: Prioritizes simplicity and ecosystem stability. JIT only recently introduced
  • JVM: A heavyweight JIT pipeline optimized for long-running servers
  • V8: Multi-tier compilation balancing short script startup with server-side optimization
  • LuaJIT: A tracing JIT that extracts remarkable performance from a compact codebase

We may never reach the ideal VM β€” zero overhead, infinitely fast β€” but through interpreter improvements, JIT compiler evolution, and hardware co-design, the gap keeps narrowing. CPython's rapid acceleration in the 2020s (3.11's Specializing Adaptive Interpreter, 3.13's copy-and-patch JIT) demonstrates that bytecode VM technology is still very much evolving. Meanwhile, WebAssembly (WASM) is bringing bytecode VM design principles beyond the browser into server-side and edge computing.

References

  • Aho, Lam, Sethi, Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition), 2006
  • Smith & Nair. Virtual Machines: Versatile Platforms for Systems and Processes, 2005
  • Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl. "Virtual Machine Showdown: Stack Versus Registers", ACM Transactions on Architecture and Code Optimization, 2008
  • CPython Developer's Guide. https://devguide.python.org/internals/
  • Mark Shannon. "Faster CPython" (PEP 659 β€” Specializing Adaptive Interpreter)
  • Mike Pall. "LuaJIT 2.0 intellectual property disclosure and research opportunities", 2009
  • "Life of a V8 JavaScript object", V8 Blog. https://v8.dev/blog
  • WΓΌrthinger et al. "One VM to Rule Them All" (Truffle/Graal), ACM DL, 2013
  • Bryce Adelstein Lelbach. "A Deep Introduction to JIT Compilers: JITs are not very Just-in-Time", CppCon 2021
  • HΓΆlzle, Chambers, Ungar. "Optimizing Dynamically-Typed Object-Oriented Languages With Polymorphic Inline Caches", ECOOP, 1991
  • Agesen. "Design and Implementation of PEP, a Java Just-In-Time Translator", Sun Microsystems, 1997