Inside Bytecode VMs β From Interpreters to JIT Compilers
A deep dive into bytecode virtual machines like Python's CPython and JVM. We trace the entire journey from source code to bytecode compilation, explore stack machines vs register machines, interpreter implementation patterns, and finally JIT compiler optimizations that bring dynamic languages close to native speed.
You've probably seen programming languages classified as either "compiled" or "interpreted." But most major languages today β Python, Java, C#, JavaScript, Ruby, Lua β actually sit somewhere in between, using a bytecode VM (virtual machine) architecture.
In this article, we'll take an exhaustive look at how bytecode VMs work. We'll trace the journey from source code to bytecode, explore how VMs execute bytecode, and dive deep into JIT compilers that optimize execution at runtime.
Execution Models β Compiled, Interpreted, and VM-Based
Let's start by classifying how programs are executed.
Native Compilation
Languages like C, C++, Rust, and Go compile source code directly into machine code (native code). The resulting binary is executed directly by the CPU, yielding the highest possible performance.
Source code (.c) β Compiler β Machine code (.exe) β CPU executes directlyThe upside is top-tier performance, but the binary is platform-specific β you need to recompile for each OS and architecture.
Pure Interpretation
Early BASIC and shell scripts work by reading and executing source code line by line.
Source code β Interpreter reads and executes one line at a timeThis is platform-independent and convenient, but parsing source code on every execution makes it the slowest approach.
Bytecode VM
This is the model used by Python, Java, C#, and many others. Source code is first compiled to an intermediate representation (bytecode), which is then executed by a virtual machine (VM).
Source code β Compiler β Bytecode β VM executesThis approach offers several advantages:
- Portability: Bytecode is platform-independent. It runs anywhere the VM exists (Java's "Write once, run anywhere")
- Faster than pure interpretation: Source code is parsed only once. Bytecode is a compact, efficient format
- Room for optimization: A JIT compiler can achieve near-native speed at runtime
The Compilation Pipeline β From Source Code to Bytecode
Let's trace the bytecode generation process, using Python as our example.
Lexical Analysis (Lexing / Tokenization)
The source code string is converted into a sequence of tokens (lexemes).
# Source code
x = 1 + 2 * 3Token sequence:
NAME("x") β EQUAL("=") β NUMBER("1") β PLUS("+")
β NUMBER("2") β STAR("*") β NUMBER("3") β NEWLINEThe lexer skips whitespace and comments while splitting the source into the smallest meaningful units.
Parsing
The token sequence is transformed into an Abstract Syntax Tree (AST) β a tree structure representing the program's structure.
Assign
/ \
Name BinOp(+)
"x" / \
Num(1) BinOp(*)
/ \
Num(2) Num(3)The parser follows the language's grammar rules, correctly reflecting operator precedence (* before +) in the tree structure.
Semantic Analysis
Additional checks and information are attached to the AST:
- Variable scope resolution (local, global, or closure capture)
- Type checking (for statically typed languages)
- Simple optimizations like constant folding
In Python, symbol table construction happens at this stage, determining which scope each variable belongs to.
Code Generation
The AST (or its intermediate representation) is traversed to generate bytecode instructions.
# Examining actual bytecode in Python
import dis
def example():
x = 1 + 2 * 3
dis.dis(example) 2 RESUME 0
3 LOAD_CONST 1 (7)
STORE_FAST 0 (x)
RETURN_CONST 0 (None)Several things are worth noting:
- Constant folding:
1 + 2 * 3has been precomputed to7. CPython's compiler evaluates constant expressions at the AST stage - LOAD_CONST / STORE_FAST: The bytecode is based on a stack machine model (more on this later)
- RETURN_CONST: Python functions implicitly return
Nonewhen there's no explicitreturn
Let's look at a slightly more complex example:
def loop_example():
total = 0
for i in range(10):
total += i
return total
dis.dis(loop_example) 2 RESUME 0
3 LOAD_CONST 1 (0)
STORE_FAST 0 (total)
4 LOAD_GLOBAL 1 (range)
LOAD_CONST 2 (10)
CALL 1
GET_ITER
>> FOR_ITER 7 (to 36)
STORE_FAST 1 (i)
5 LOAD_FAST 0 (total)
LOAD_FAST 1 (i)
BINARY_OP 13 (+=)
STORE_FAST 0 (total)
JUMP_BACKWARD 9 (to 18)
>> END_FOR
POP_TOP
6 LOAD_FAST 0 (total)
RETURN_VALUEThe for loop is translated into a GET_ITER β FOR_ITER β JUMP_BACKWARD bytecode pattern. This is the actual instruction sequence the VM executes.
The interactive demo below lets you step through bytecode execution one instruction at a time. Observe how the instruction pointer (IP), operand stack, and local variables change at each step.
Bytecode Execution Simulator
Step through bytecode execution of a function that starts with total=0, i=1 and repeats total += i; i += 1 while i < 3.
Bytecode Structure
Bytecode is essentially a byte sequence of instruction (opcode) and operand pairs.
Opcodes and Operands
Bytecode instruction:
[opcode] [operand]
1 byte 0 to several bytes
Example (CPython 3.12+, base instruction = 1 word = 2 bytes; some have inline cache entries):
LOAD_CONST 1 β Push value at index 1 of constant table onto stack
STORE_FAST 0 β Store stack top into local variable slot 0
BINARY_OP 0 β Pop 2 values from stack, add, push resultSince CPython 3.6, all instructions use a word code format (2 bytes = 1 byte opcode + 1 byte operand). CPython 3.12 and later extended some instructions to 4 bytes. When an operand exceeds 255, an EXTENDED_ARG instruction is prepended to extend the operand.
Code Objects
In Python, compiled bytecode is stored as a code object.
def add(a, b):
return a + b
code = add.__code__
print(code.co_code) # Raw byte sequence
print(code.co_consts) # Constant table: (None,)
print(code.co_varnames) # Local variable names: ('a', 'b')
print(code.co_stacksize) # Required stack depth: 2
print(code.co_nlocals) # Number of local variables: 2Code objects contain not just the instruction sequence but all metadata the VM needs: constant tables, variable name tables, stack size information, and more.
.pyc Files β Bytecode Caching
Python caches compiled bytecode as .pyc files when importing modules.
__pycache__/
module.cpython-312.pyc.pyc file format:
| Offset | Size | Content |
|---|---|---|
| 0 | 4 bytes | Magic number (identifies Python version) |
| 4 | 4 bytes | Bit field (invalidation mode flags) |
| 8 | 8 bytes | Source file timestamp or hash |
| 16 | onward | Marshalled code object |
On subsequent imports, if the .pyc is up to date, Python skips recompilation and loads the bytecode directly, reducing startup time.
Stack Machines vs Register Machines
VM architectures fall into two major categories: stack machines and register machines.
Stack Machines
This is the model used by CPython, JVM, and CLR. All operations go through an operand stack.
Expression: a + b * c bytecode (stack machine)
Instruction Stack state (left = bottom)
ββββββββββββββββββββββββββββββββββββββββ
LOAD a [a]
LOAD b [a, b]
LOAD c [a, b, c]
MUL [a, (b*c)]
ADD [(a+b*c)]Pros:
- Compact instructions (no need to specify register numbers in operands)
- Easy code generation (direct post-order AST traversal)
- Simple instruction set makes VM implementation straightforward
Cons:
- More instructions needed for the same computation (frequent LOAD/STORE)
- Stack reads/writes become memory accesses with associated overhead
Register Machines
This model is used by Lua (5.0+), Dalvik VM (Android), and some JS engines. Operations use an array of virtual registers.
Expression: a + b * c bytecode (register machine)
Instruction Register state
ββββββββββββββββββββββββββββββββββββββββ
MUL R1, R_b, R_c R1 = b * c
ADD R0, R_a, R1 R0 = a + b * cPros:
- Fewer instructions (one instruction can specify multiple operands)
- Less data movement, reducing interpreter dispatch overhead
Cons:
- Larger instructions (register numbers require extra bits in operands)
- More complex code generation (register allocation needed)
Real-World VM Choices
| VM | Architecture | Language |
|---|---|---|
| CPython | Stack machine | Python |
| JVM | Stack machine | Java, Kotlin, Scala, Clojure |
| CLR | Stack machine | C#, F#, VB.NET |
| Lua VM | Register machine | Lua |
| Dalvik | Register machine | Java (Android, replaced by ART in Android 5.0) |
| BEAM | Register machine (hybrid) | Erlang, Elixir |
Research has shown that register machines can reduce instruction dispatch count by 47% compared to stack machines (Yunhe Shi et al., "Virtual Machine Showdown: Stack Versus Registers", VEE 2005 / ACM TACO 2008). However, bytecode size is roughly 26% larger for register machines.
The demo below executes the same expression a + b Γ c on a stack machine and register machine simultaneously, so you can feel the difference in instruction count.
Stack Machine vs Register Machine
Compare stack and register machine execution side by side, computing a + b Γ c.
Interpreter Implementation Patterns
The interpreter β the core of the VM that executes bytecode β can be implemented using several patterns.
Switch-based Interpreter
The simplest implementation. A giant switch statement handles each opcode.
// Simplified VM loop (C language)
void eval(uint8_t *bytecode, Value *stack) {
uint8_t *ip = bytecode; // Instruction Pointer
Value *sp = stack; // Stack Pointer
for (;;) {
uint8_t opcode = *ip++;
switch (opcode) {
case OP_LOAD_CONST: {
uint8_t idx = *ip++;
*sp++ = constants[idx];
break;
}
case OP_ADD: {
Value b = *--sp;
Value a = *--sp;
*sp++ = value_add(a, b);
break;
}
case OP_STORE_FAST: {
uint8_t slot = *ip++;
locals[slot] = *--sp;
break;
}
case OP_RETURN: {
return;
}
// ... hundreds of opcodes
}
}
}Pros: Simple to implement and understand. The historical core of CPython.
Cons: Poor CPU branch prediction. All opcodes share the same switch branch point, so the branch predictor can't predict "which opcode comes next," causing frequent pipeline stalls.
Direct Threaded Interpreter
Uses GCC's computed goto extension to jump directly to each opcode handler's address.
void eval_threaded(uint8_t *bytecode, Value *stack) {
// Address table for each handler
static void *dispatch_table[] = {
[OP_LOAD_CONST] = &&op_load_const,
[OP_ADD] = &&op_add,
[OP_STORE_FAST] = &&op_store_fast,
[OP_RETURN] = &&op_return,
// ...
};
uint8_t *ip = bytecode;
Value *sp = stack;
#define DISPATCH() goto *dispatch_table[*ip++]
DISPATCH(); // Jump to first instruction
op_load_const: {
uint8_t idx = *ip++;
*sp++ = constants[idx];
DISPATCH();
}
op_add: {
Value b = *--sp;
Value a = *--sp;
*sp++ = value_add(a, b);
DISPATCH();
}
op_store_fast: {
uint8_t slot = *ip++;
locals[slot] = *--sp;
DISPATCH();
}
op_return:
return;
}Pros: Since each handler ends with a different indirect jump, the CPU branch predictor can maintain independent entries per handler. Reported 15-25% speedup over switch-based implementations.
Cons: Depends on GCC/Clang extensions (not standard C). Not supported by MSVC.
CPython 3.x detects at compile time whether computed goto is available and uses direct threading when it is.
Token Threaded / Subroutine Threaded
Beyond direct threading, there are other threading approaches:
- Token Threaded: Keeps bytecode as opcode indices and looks up the dispatch table at dispatch time. Produces more compact bytecode than direct threading
- Subroutine Threaded: Implements each opcode handler as a function and converts the bytecode sequence into an array of function pointers. Uses
call/retinstructions, which benefit from the CPU's Return Stack Buffer (RSB) for branch prediction
Specializing Adaptive Interpreter
A groundbreaking optimization introduced in CPython 3.11. It specializes bytecode instructions based on runtime data type patterns.
Generic instruction: BINARY_OP ADD
β Runtime detects "int + int" pattern
Specialized instruction: BINARY_OP_ADD_INTHow it works:
- Each instruction gets an adaptive counter
- After an instruction executes a certain number of times, the VM analyzes recent operand type patterns
- If the pattern is stable, the generic instruction is rewritten to a more efficient specialized instruction
- If an unexpected type appears, deoptimization reverts to the generic instruction
Key specialized instructions (CPython 3.12+):
| Generic Instruction | Specialized Instruction | Condition |
|---|---|---|
LOAD_ATTR | LOAD_ATTR_INSTANCE_VALUE | Instance attribute dict access |
LOAD_ATTR | LOAD_ATTR_SLOT | Class using __slots__ |
LOAD_ATTR | LOAD_ATTR_MODULE | Module attribute |
BINARY_OP | BINARY_OP_ADD_INT | Both operands are int |
BINARY_OP | BINARY_OP_ADD_FLOAT | Both operands are float |
BINARY_OP | BINARY_OP_ADD_UNICODE | Both operands are str |
COMPARE_OP | COMPARE_OP_INT | Integer comparison |
CALL | CALL_PY_EXACT_ARGS | Python function call (exact arg count) |
FOR_ITER | FOR_ITER_LIST | List iteration |
UNPACK_SEQUENCE | UNPACK_SEQUENCE_TWO_TUPLE | 2-element tuple unpack |
This optimization made CPython 3.11 on average 25% faster than 3.10. Achieving this level of improvement through interpreter-only techniques β without JIT β was remarkable.
Frames and the Call Stack
Let's examine how VMs manage function calls.
Frame Objects
Each time a function is called, the VM creates a frame (stack frame / execution frame).
βββββββββββββββββββββββββββββββββββ
β Frame β
βββββββββββββββββββββββββββββββββββ€
β Code object (code) β β Bytecode instructions
β Instruction pointer (IP) β β Current execution position
β Local variable array (locals) β β [a, b, result, ...]
β Operand stack β β Temporary stack for operations
β Link to previous frame β β Caller's frame
β Exception handler stack β β try/except info
β Builtins namespace reference β β builtins
β Global namespace reference β β globals
βββββββββββββββββββββββββββββββββββCall Stack Growth
def outer():
x = 10
return inner(x)
def inner(n):
return n * 2Runtime call stack:
inner(10) frame β Currently executing
code: inner.__code__
locals: [n=10]
stack: []
ββββββββββββββββββ
outer() frame
code: outer.__code__
locals: [x=10]
stack: []
ββββββββββββββββββ
<module> frame
code: <module>.__code__
locals: []
stack: []When inner executes return, its frame is discarded and control returns to outer's frame.
CPython limits the frame stack depth via sys.getrecursionlimit() (default 1000). This is a safety valve to prevent segmentation faults from stack overflow.
Frame Optimizations
Frame creation and destruction are overhead, so VMs apply various optimizations:
- CPython 3.11+: Inlined frames (
_PyInterpreterFrame). Avoids heap allocation by placing frames directly on the C stack - JVM: Frames are placed on the native stack. The JIT compiler may completely eliminate function calls through inlining
- V8: The JIT compiler merges frames via inlining and optimizes stack frame layout for deoptimization metadata
Object Representation and Tagged Values
How does a VM internally represent values in a dynamically typed language? This directly affects VM performance.
Boxing
In CPython, every value is represented as a pointer to a PyObject struct. Even the integer 42 is a heap-allocated object.
// CPython's PyObject struct (simplified)
typedef struct {
Py_ssize_t ob_refcnt; // Reference count
PyTypeObject *ob_type; // Pointer to type
} PyObject;
// Python integers are arbitrary precision, so they use a variable-length digit array
typedef struct {
PyObject_VAR_HEAD // ob_refcnt + ob_type + ob_size
digit ob_digit[1]; // Arbitrary-precision digit array (30 bits/digit)
} PyLongObject;
// Note: CPython 3.12+ introduced a "compact" representation for small ints,
// storing the value in ob_size bit fields and omitting ob_digitPython's int is an arbitrary-precision integer that can represent values of any size. Even a small integer like 42 requires: reference count (8 bytes) + type pointer (8 bytes) + size (8 bytes) + digit array + memory allocator overhead = at least 28 bytes. Compare this to C's int at just 4 bytes β quite a luxurious use of memory.
NaN Boxing
Some VMs exploit the IEEE 754 NaN (Not a Number) bit pattern to store both a type tag and value within a 64-bit double.
IEEE 754 double (64 bits):
sign(1) | exponent(11) | mantissa(52)
NaN bit pattern:
x | 11111111111 | 1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
β quiet NaN bit
Value storage:
- Floating point: stored directly as IEEE 754 bit pattern
- Integers, pointers, booleans: stored in NaN mantissa with tag bits + valueNaN Boxing layout example:
[0][11111111111][1][TTT][PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP]
β tag (3 bits) β payload (48 bits)
Tags:
000 = pointer (object)
001 = 32-bit integer
010 = boolean
011 = nil/null
...LuaJIT, JavaScriptCore (Safari), and SpiderMonkey (Firefox) use this technique. In SpiderMonkey, NaN boxing is the primary value representation used throughout the engine. Since values fit in a 64-bit word on the stack, heap allocation and pointer indirection are eliminated, yielding significant performance gains.
Tagged Pointers
On 64-bit architectures, alignment ensures that the lower few bits of pointers are always zero. These spare bits can be used as type tags.
Normal pointer (8-byte alignment):
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP000
β always 0 (lower 3 bits)
Tagged pointer:
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP|TTT|
β pointer portion (upper 61 bits) β tag (lower 3 bits)V8 (Google's JavaScript engine) uses tagged pointers to distinguish Smi (Small Integer) from HeapObject:
- Least significant bit is
0β Smi (31-bit integer stored in upper 31 bits) - Least significant bit is
1β HeapObject pointer (mask off tag bit to dereference)
CRuby (MRI) uses a similar technique, representing Fixnum (small integers) without heap allocation.
JIT Compilers β Runtime Native Code Generation
This is the heart of the article. A JIT (Just-In-Time) compiler compiles bytecode to native machine code during program execution, dramatically speeding up subsequent execution by running native code directly.
Why JIT Is Needed
When interpreting bytecode, dozens to hundreds of native instructions execute for each bytecode instruction.
Bytecode: BINARY_OP_ADD_INT
What the interpreter actually does (outline):
1. Fetch opcode (memory read)
2. Get handler address from dispatch table
3. Jump to handler
4. Pop operand B from stack
5. Pop operand A from stack
6. Type check A (is it int?)
7. Type check B (is it int?)
8. Overflow check
9. Perform addition
10. Push result onto stack
11. Advance to next opcode
When JIT compiles to native code:
add rax, rbx β Just 1 instruction (or a few)For integer addition, the interpreter incurs tens of times more overhead. JIT eliminates this overhead.
JIT Compilation Flow
Source code
β
Compile to bytecode
β
Start executing in interpreter
β
Profiling (collect type info, execution frequency)
β
Hot spot detection (frequently executed code)
β
JIT compile (bytecode β IR β optimization β native code)
β
Execute native code (bypass interpreter on subsequent runs)
β
If assumptions break, deoptimize (native code β back to interpreter)Method JIT vs Tracing JIT
JIT compilers fall into two major approaches.
Method JIT
Compiles at the function (method) level. JVM (HotSpot C2), V8, and .NET's RyuJIT use this approach.
Function calculate() was called 1000 times
β Identified as "hot method"
β Compile entire function to native code
β Execute native code on subsequent callsPros: Clear compilation units make optimization straightforward. Pairs well with inlining. Cons: Compiling large functions takes time. Even if only part of a function is hot, the entire function is compiled.
Tracing JIT
Traces (records) execution paths and compiles the actual executed bytecode sequence. LuaJIT and the former Mozilla TraceMonkey used this approach.
Record loop execution path:
LOAD x β LOAD y β ADD β COMPARE β BRANCH(true) β STORE β JUMP_BACK
Compile this "trace" to native code:
- Assume branch is always true (verified by guard instructions)
- Assume types match recorded types (verified by guard instructions)
- Fall back to interpreter if a guard failsPros: Extremely effective for loop optimization. Only the actually executed path is compiled, eliminating dead code. Cross-function inlining happens naturally. Cons: Weak against complex branching patterns (frequent deoptimization from guard failures). Trace recording has overhead.
Type Specialization
The most important JIT optimization for dynamically typed languages is type specialization.
def add(a, b):
return a + bThis function can accept int, float, str, list, or any other type. But if profiling reveals "this function is always called with int," the JIT can specialize:
Generic version (interpreter):
1. Get type of a
2. Get type of b
3. Look up __add__ method for the type
4. Call __add__
5. Box result and return
Specialized version (JIT, assuming int):
; Guard: verify a is int
test [rdi + type_offset], INT_TAG
jne deoptimize
; Guard: verify b is int
test [rsi + type_offset], INT_TAG
jne deoptimize
; Direct addition
mov rax, [rdi + value_offset]
add rax, [rsi + value_offset]
; Overflow check
jo deoptimize
retType checking overhead is minimized through guard instructions. If the guard succeeds, the optimized path runs; if it fails, deoptimization returns to the interpreter.
Inlining
One of the most powerful JIT compiler optimizations. It replaces function calls with the called function's body.
def square(x):
return x * x
def sum_of_squares(a, b):
return square(a) + square(b)The JIT compiler transforms this into:
sum_of_squares native code (after inlining):
; Inline square(a)
mov rax, [rdi] ; value of a
imul rax, rax ; a * a
mov rcx, rax ; save result
; Inline square(b)
mov rax, [rsi] ; value of b
imul rax, rax ; b * b
; Addition
add rax, rcx ; a*a + b*b
retBenefits of inlining:
- Function call/return overhead is eliminated
- Further optimizations (constant propagation, dead code elimination) can be applied to the inlined code
- Optimizations can use the caller's context (type information)
Deoptimization
When assumptions made by the JIT are violated during execution, native code execution must be suspended and control returned to the interpreter. This is deoptimization.
def process(x):
return x + 1
# First 10,000 calls with int β JIT generates int version
for i in range(10000):
process(i)
# Suddenly called with float β guard fails, deoptimization occurs
process(3.14)The deoptimization process:
- Guard instruction detects failure
- Capture current machine register state
- Reconstruct interpreter frame (stack, local variables, instruction pointer)
- Return control to interpreter
- Update profile information and set recompilation trigger
Deoptimization is very expensive, but if it's rare enough, overall JIT performance still improves. The problem is when deoptimization happens frequently β the compile β deoptimize β recompile cycle can make things slower.
On-Stack Replacement (OSR)
Normally, JIT compilation takes effect on a function's next invocation. But for long-running loops, you can't wait until the function is called again.
OSR is a technique that replaces the currently executing interpreter frame with JIT-compiled native code mid-execution.
Running loop:
for i in range(1000000): β During loop iteration...
...
β
JIT compilation complete!
β
OSR: Switch from interpreter to native code
β
Remaining loop iterations run in native codeWhat OSR requires:
- Capture interpreter frame state (local variables, stack, loop counter)
- Map to native code frame layout
- Jump to the appropriate position in the native code loop body
JVM's HotSpot, V8, and SpiderMonkey all support OSR.
Multi-Tier JIT Compilation
JIT compilation involves a tradeoff between compilation time and optimization level. Compile quickly with low optimization, or spend time for high optimization β it's a dilemma.
Most practical JITs solve this with tiered compilation.
The interactive demo below shows how functions progress from interpreter to baseline JIT to optimizing JIT as they execute. You can also experience deoptimization and recompilation.
Multi-Tier JIT Compilation in Action
Watch 3 functions progress from interpreter β baseline JIT β optimizing JIT as they execute. Experience deoptimization and recompilation.
JVM HotSpot's Tiered Compilation
JVM's HotSpot has 5 tiers, but methods don't necessarily pass through all of them sequentially. The most common path is Tier 0 β Tier 3 β Tier 4 (interpreter β C1 with full profiling β C2):
Tier 0: Interpreter (collects profiling data)
β Method call count exceeds threshold (this is typically the first compilation)
Tier 3: C1, full profiling
β Sufficient profile data accumulated
Tier 4: C2 (Server compiler), aggressive optimization
Tiers used in special cases:
Tier 1: C1, no profiling (for trivial methods)
Tier 2: C1, limited profiling (fallback when C2 queue is full)| Tier | Compiler | Compile Speed | Optimization Level | Profiling |
|---|---|---|---|---|
| 0 | None (interpreter) | - | None | Yes |
| 1 | C1 | Fast | Basic | No |
| 2 | C1 | Fast | Basic | Limited |
| 3 | C1 | Fast | Basic | Full |
| 4 | C2 | Slow | Aggressive | - |
C1 (Client compiler) performs quick, basic optimizations for applications where short startup time matters. C2 (Server compiler) spends more time on aggressive optimizations for long-running server applications.
Key C2 optimizations:
- Loop unrolling
- Escape analysis for allocation elimination
- Lock elision
- Bounds check elimination
- Dead code elimination
- Vectorization (SIMD instructions)
V8's Tiered Compilation
Google's V8 JavaScript engine also uses tiered compilation.
Ignition (interpreter)
β Function becomes hot
Sparkplug (baseline compiler, V8 v9.1+)
β Further profiling
Maglev (mid-tier compiler, Chrome 117+ / V8 v11.7+)
β Sufficient profile data
TurboFan (optimizing compiler)- Ignition: Register-based bytecode interpreter. All code starts here
- Sparkplug: Non-optimizing compiler that generates native code directly from bytecode. Extremely fast compilation (bypasses AST)
- Maglev (Chrome 117+ / V8 v11.7+): Mid-tier compiler using SSA-based intermediate representation with type-feedback-based optimization
- TurboFan: Full optimizing compiler using Sea of Nodes IR. Performs inlining, loop optimization, escape analysis, and more
GraalVM's Graal JIT
Oracle's GraalVM ships with a JIT compiler (Graal) that is implemented in Java itself.
Notable features:
- Partial Escape Analysis: When an object escapes only on certain paths, allocation happens only on those paths β on other paths, it's kept as scalar values
- Polyglot support (Truffle framework): Write an AST interpreter using Truffle's API (
@Specialization,@NodeChild, etc.), and Graal JIT automatically applies optimization. Supports Python, Ruby, JavaScript, R, LLVM bitcode, and more - Native Image: AOT (Ahead-Of-Time) compilation produces standalone binaries with near-instant startup time (on the order of milliseconds)
Major VM Comparison
Let's compare today's major bytecode VMs.
CPython (Python)
Features:
- Stack-based bytecode VM
- GIL (Global Interpreter Lock) limits multi-threading
- Reference counting + cycle detection GC
- CPython 3.11: Specializing Adaptive Interpreter
- CPython 3.13: Experimental JIT (copy-and-patch JIT)
- CPython 3.13: Free-threaded mode (GIL disabled, experimental)CPython 3.13's copy-and-patch JIT is an ultra-lightweight JIT that copies pre-prepared native code templates (stencils) and patches operands. It minimizes compilation time while eliminating interpreter dispatch overhead.
Importantly, the copy-and-patch JIT does not compile bytecodes directly. Instead, the Tier 2 optimizer first converts bytecodes to a micro-operation (ΞΌop) intermediate representation, and the JIT is applied to that ΞΌop sequence.
Copy-and-patch flow:
1. At build time, generate native code templates (stencils) for each ΞΌop
2. At runtime, the Tier 2 optimizer converts bytecodes to ΞΌop sequences
3. Copy templates for hot ΞΌop sequences into memory
4. Patch operands (constant values, variable slot numbers)
5. Concatenate templates to form complete native code sequence
6. Execute the concatenated template sequence directlyJVM (Java, Kotlin)
Features:
- Stack-based bytecode VM
- Tiered JIT (C1 + C2, or Graal)
- Advanced GC (G1, ZGC, Shenandoah)
- Bytecode verification for security guarantees
- Dynamic class loading via ClassLoadersThe JVM bytecode verifier checks before execution:
- No stack overflow/underflow
- Local variable access is within bounds
- Type consistency is maintained
finallyblocks are properly connected
This prevents execution of malformed bytecode and guarantees VM memory safety.
CLR (C#, F#)
Features:
- Stack-based CIL (Common Intermediate Language)
- RyuJIT (JIT compiler)
- Value types (struct) β avoids boxing
- Span<T> / stackalloc for zero-allocation
- NativeAOT for ahead-of-time compilation
- Generational mark-and-sweep GCCLR's standout feature is value type support. struct values are placed directly on the stack, significantly reducing heap allocation and GC pressure.
// Value type β placed on stack
struct Point { public int X, Y; }
// Reference type β heap allocated
class PointClass { public int X, Y; }
void Example() {
Point p = new(1, 2); // On the stack. No GC needed
PointClass pc = new(1, 2); // Heap allocated. GC managed
}LuaJIT
Features:
- Register-based bytecode VM
- Tracing JIT
- NaN Boxing for value representation
- Extremely fast (among the fastest dynamic language VMs)
- Fast FFI (Foreign Function Interface)LuaJIT, developed by Mike Pall, is one of the fastest tracing JITs in existence. Despite its compact codebase (~100K lines of C/assembly), it achieves near-C performance on many benchmarks.
How LuaJIT's tracing JIT works:
- Hot loop detection: A loop's back-edge execution count exceeding a threshold marks it as "hot"
- Trace recording: Records the hot loop's execution path as a "trace." Function calls are unfolded into the trace
- SSA conversion: Converts the trace to SSA (Static Single Assignment) form IR
- Optimization: Constant folding, algebraic simplification, CSE (Common Subexpression Elimination), DCE (Dead Code Elimination), loop-invariant code motion, register allocation
- Code generation: Generates native code directly from optimized IR (supports x86, x64, ARM, MIPS, PPC)
V8 (JavaScript)
Features:
- Bytecode (Ignition) + tiered JIT
- Hidden Classes for efficient object representation
- Inline Caches (IC) for fast property access
- Generational GC (Scavenger + Mark-Compact)
- Tagged pointersV8's Hidden Class (Map) system infers fixed layouts β similar to C++ classes β for dynamically typed objects.
// These two objects share the same Hidden Class
let a = { x: 1, y: 2 };
let b = { x: 3, y: 4 };
// This triggers a transition to a new Hidden Class
a.z = 5;Objects with properties added in the same order share the same Hidden Class, allowing the JIT compiler to optimize property access based on it.
Concrete Example: Python Bytecode Execution Trace
Finally, let's trace exactly how Python bytecode is executed, step by step.
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n - 1) + fibonacci(n - 2)import dis
dis.dis(fibonacci) 2 RESUME 0
3 LOAD_FAST 0 (n)
LOAD_CONST 1 (1)
COMPARE_OP 26 (<=)
POP_JUMP_IF_FALSE 1 (to 14)
4 LOAD_FAST 0 (n)
RETURN_VALUE
5 >> LOAD_GLOBAL 1 (fibonacci)
LOAD_FAST 0 (n)
LOAD_CONST 1 (1)
BINARY_OP 10 (-)
CALL 1
LOAD_GLOBAL 1 (fibonacci)
LOAD_FAST 0 (n)
LOAD_CONST 2 (2)
BINARY_OP 10 (-)
CALL 1
BINARY_OP 0 (+)
RETURN_VALUELet's trace the execution of fibonacci(3), showing the stack changes:
fibonacci(3) execution trace:
Instruction Stack Notes
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
RESUME [] Frame initialization
LOAD_FAST 0 (n) [3] Push n=3
LOAD_CONST 1 (1) [3, 1] Push constant 1
COMPARE_OP <= (26) [False] 3 <= 1 β False
POP_JUMP_IF_FALSE [] False, so jump
LOAD_GLOBAL 1 [NULL, fibonacci] Push NULL + function object
LOAD_FAST 0 (n) [NULL, fibonacci, 3] n=3
LOAD_CONST 1 (1) [NULL, fibonacci, 3, 1] Constant 1
BINARY_OP - (10) [NULL, fibonacci, 2] 3 - 1 = 2
CALL 1 [β calls fibonacci(2)]
... recursive call ...
[result_1] fibonacci(2) result = 1
LOAD_GLOBAL 1 [1, NULL, fibonacci]
LOAD_FAST 0 (n) [1, NULL, fibonacci, 3]
LOAD_CONST 2 (2) [1, NULL, fibonacci, 3, 2]
BINARY_OP - (10) [1, NULL, fibonacci, 1] 3 - 2 = 1
CALL 1 [β calls fibonacci(1)]
... recursive call ...
[1, result_2] fibonacci(1) result = 1
BINARY_OP + (0) [2] 1 + 1 = 2
RETURN_VALUE β returns 2Inline Caching β Speeding Up Dynamic Dispatch
In dynamically typed languages, every property access or method call requires checking the object's type and looking up the property's location. Inline Caching (IC) caches these lookup results for reuse.
IC State Transitions
Uninitialized
β First access
Monomorphic
β Only one type observed. Highest cache hit rate
β Different type observed
Polymorphic
β 2-4 types observed. Linear search through multiple cache entries
β More types observed
Megamorphic
β Too many types observed. Cache abandoned, falls back to generic pathConcrete Example
function getX(obj) {
return obj.x; // β IC is set up at this call site
}
// Always called with same-shape objects β Monomorphic (fastest)
getX({ x: 1, y: 2 });
getX({ x: 3, y: 4 });
// Mixed object shapes β Polymorphic β Megamorphic
getX({ x: 1 });
getX({ x: 1, y: 2, z: 3 });
getX({ a: 0, x: 1 });A monomorphic IC can speed up property access to one comparison + one memory read. JIT compilers use monomorphic IC information to generate direct field access code with type guards.
Profiling Mechanics
For a JIT compiler to decide "which code to compile" and "which types to specialize for," profiling data collection is essential.
Profiling Approaches Compared
| VM | Profiling Method | Mechanism |
|---|---|---|
| JVM HotSpot | Invocation counter + backedge counter | Counts method invocations and loop backedge executions. Decays over time to prioritize recent hot spots |
| V8 | Interrupt Budget | Decrements a budget per bytecode instruction. Triggers compilation when budget reaches 0 |
| CPython 3.11+ | Per-instruction adaptive counter | Attaches a counter to each bytecode instruction. Specialization is attempted when threshold is reached |
HotSpot's counter decay is an important design decision. It prevents code that ran frequently during startup β but rarely during steady state β from being considered "hot" indefinitely.
GCβJIT Coordination β Safepoints
JIT-compiled native code must cooperate with the garbage collector. For the GC to scan and relocate heap objects, all threads must stop at a safe state (safepoint).
Safepoint Placement
JIT compilers insert safepoint polls into generated native code.
JIT-generated code (outline):
...
add rax, rbx ; computation
test [safepoint_page], eax ; β
safepoint poll
... ; when safepoint_page is made unreadable,
; a page fault occurs β control transfers to GCIn JVM HotSpot, safepoints are placed at:
- Just before method
return - Loop backedges (jumps from end to start of loop)
- JNI / native method call transitions
For counted loops (loops with an int counter and a known bound), backedge safepoints may be omitted, which can cause unexpectedly long GC pauses (Time-To-Safepoint issues). Since JDK 17, loop strip mining is enabled by default, breaking counted loops into chunks with safepoint polls between them to mitigate this problem.
JIT Security β The W^X Problem
JIT compilation poses an inherent security challenge. Generating native code at runtime requires memory to be both writable (W) and executable (X), violating the W^X (Write XOR Execute) security policy.
Addressing the W^X Problem
| Approach | Description | Used By |
|---|---|---|
| W/X toggle | Memory is W during code generation, then switched to X | Most JITs |
| Double mapping | Map the same physical page as W (for writing) and X (for execution) at two virtual addresses | OpenJDK, V8 |
| JIT prohibition | No JIT β use only AOT or interpreter | iOS apps (except Safari) |
On iOS, Apple prohibits JIT for third-party apps. Safari and WKWebView (including third-party browsers like Chrome and Firefox) have JIT enabled, but apps that embed JavaScriptCore directly run with only the interpreter (LLInt) and no JIT tiers. This can impact JavaScript performance on iOS.
Conclusion
Bytecode VMs are an elegant architecture that balances portability with execution efficiency.
Source Code
β Lexing β Parsing β Semantic Analysis β Code Generation
Bytecode (portable intermediate representation)
β VM executes
Interpreter (executes all code first)
β Profiling
JIT Compiler (converts hot spots to native code)
β Type specialization β Inlining β Various optimizations
Native Code (CPU executes directly)Each VM makes different tradeoffs:
- CPython: Prioritizes simplicity and ecosystem stability. JIT only recently introduced
- JVM: A heavyweight JIT pipeline optimized for long-running servers
- V8: Multi-tier compilation balancing short script startup with server-side optimization
- LuaJIT: A tracing JIT that extracts remarkable performance from a compact codebase
We may never reach the ideal VM β zero overhead, infinitely fast β but through interpreter improvements, JIT compiler evolution, and hardware co-design, the gap keeps narrowing. CPython's rapid acceleration in the 2020s (3.11's Specializing Adaptive Interpreter, 3.13's copy-and-patch JIT) demonstrates that bytecode VM technology is still very much evolving. Meanwhile, WebAssembly (WASM) is bringing bytecode VM design principles beyond the browser into server-side and edge computing.
References
- Aho, Lam, Sethi, Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition), 2006
- Smith & Nair. Virtual Machines: Versatile Platforms for Systems and Processes, 2005
- Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl. "Virtual Machine Showdown: Stack Versus Registers", ACM Transactions on Architecture and Code Optimization, 2008
- CPython Developer's Guide. https://devguide.python.org/internals/
- Mark Shannon. "Faster CPython" (PEP 659 β Specializing Adaptive Interpreter)
- Mike Pall. "LuaJIT 2.0 intellectual property disclosure and research opportunities", 2009
- "Life of a V8 JavaScript object", V8 Blog. https://v8.dev/blog
- WΓΌrthinger et al. "One VM to Rule Them All" (Truffle/Graal), ACM DL, 2013
- Bryce Adelstein Lelbach. "A Deep Introduction to JIT Compilers: JITs are not very Just-in-Time", CppCon 2021
- HΓΆlzle, Chambers, Ungar. "Optimizing Dynamically-Typed Object-Oriented Languages With Polymorphic Inline Caches", ECOOP, 1991
- Agesen. "Design and Implementation of PEP, a Java Just-In-Time Translator", Sun Microsystems, 1997