Bytecode compilation is the process of translating high-level source code into an intermediate representation called bytecode - a compact, platform-independent instruction format designed for execution by a virtual machine.
What is Bytecode?
Bytecode sits between source code and machine code:
graph LR A[Source Code] -->|Compile| B[Bytecode] B -->|Interpret/JIT| C[Machine Code] C -->|Execute| D[CPU] style A fill:#e3f2fd style B fill:#fff9c4 style C fill:#f1f8e9 style D fill:#fce4ec
Characteristics:
- More abstract than machine code
- More concrete than source code
- Platform-independent
- Optimized for VM execution
- Usually not human-readable (binary format)
Compilation Pipeline
The journey from Ruby to YARV bytecode:
┌─────────────────────┐
│ Ruby Source │
│ x = 2 + 3 │
└──────────┬──────────┘
↓
[Lexical Analysis]
↓
┌─────────────────────┐
│ Token Stream │
│ [x, =, 2, +, 3] │
└──────────┬──────────┘
↓
[Syntax Analysis]
↓
┌─────────────────────┐
│ Abstract Syntax Tree│
│ (=) │
│ / \ │
│ x (+) │
│ / \ │
│ 2 3 │
└──────────┬──────────┘
↓
[Code Generation]
↓
┌─────────────────────┐
│ YARV Bytecode │
│ putobject 2 │
│ putobject 3 │
│ opt_plus │
│ setlocal x │
└─────────────────────┘
Example: Ruby to YARV Bytecode
Ruby source:
nil
YARV bytecode:
$ ruby --dump=insns -e 'nil'
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,3)> (catch: FALSE)
0000 putnil ( 1)[Li]
0001 leave
More complex example:
2 + 3
YARV bytecode:
0000 putobject_INT2FIX_1_
0001 putobject 3
0003 opt_plus
0004 leave
This shows how high-level operations (+
) become sequences of stack instructions.
Why Bytecode?
Advantages Over Direct Interpretation
1. Performance
- Parse source code once, not every execution
- Pre-optimized instruction sequences
- Faster instruction dispatch
- Smaller memory footprint
2. Portability
- Same bytecode runs on any platform
- Platform-specific VM handles execution
- No recompilation needed
3. Optimization Opportunities
- Constant folding at compile-time
- Dead code elimination
- Peephole optimization
- JIT compilation can optimize hot paths
4. Security
- Source code can remain private
- Bytecode harder to reverse-engineer than source
- Sandboxing and verification possible
Comparison: Interpretation vs Compilation
Pure Interpretation:
Source Code → [Interpreter] → Execution
Fast startup, slow execution
Example: Early JavaScript, shell scripts
Bytecode Compilation:
Source Code → [Compiler] → Bytecode → [VM] → Execution
Moderate startup, faster execution
Example: YARV, Python, Java
Native Compilation:
Source Code → [Compiler] → Machine Code → [CPU] → Execution
Slow startup, fastest execution
Example: C, C++, Rust, Go
Bytecode Structure
Bytecode consists of instructions and operands:
Instruction format:
┌──────────────┬─────────────┐
│ Opcode │ Operands │
└──────────────┴─────────────┘
(what to do) (data to use)
Example: putobject 42
└────┬───┘ └┬─┘
opcode operand
Components:
- Opcode - The operation to perform (e.g.,
putobject
,add
,jump
) - Operands - Parameters to the operation (e.g., values, addresses, offsets)
- Metadata - Line numbers, debug info, source locations
YARV Bytecode Encoding
YARV uses a compact encoding:
# Ruby source
x = 42
# Bytecode (conceptual)
putobject 42 # Push 42 onto stack
setlocal x, 0 # Store in local variable x
Encoding details:
- Variable-length instructions
- Inline operands for small values
- Constant pool for complex objects
- Line number mapping for debugging
Instruction Set Design
Bytecode instructions reflect the stack-based virtual machine architecture:
Stack Operations:
putnil # Push nil
putobject <obj> # Push object
dup # Duplicate top
pop # Remove top
swap # Swap top two
Arithmetic:
opt_plus # Add top two values
opt_minus # Subtract
opt_mult # Multiply
opt_div # Divide
Control Flow:
jump <offset> # Unconditional jump
branchif <offset> # Jump if true
branchunless <offset># Jump if false
Variables:
getlocal <index> # Read local variable
setlocal <index> # Write local variable
getinstancevariable # Read @variable
setinstancevariable # Write @variable
See YARV stack instructions for detailed exploration.
Optimizations During Compilation
1. Constant Folding
Before:
x = 2 + 3
Naive bytecode:
putobject 2
putobject 3
opt_plus
setlocal x
Optimized bytecode:
putobject 5 # Computed at compile-time
setlocal x
2. Specialized Instructions
YARV includes optimized instructions for common cases:
# Instead of:
putobject 0
# Use specialized:
putobject_INT2FIX_0_ # Smaller, faster
3. Inline Caching
Method calls can be optimized with inline caches:
obj.method_name
Bytecode stores the method lookup result for faster subsequent calls.
4. Dead Code Elimination
if false
puts "never runs"
end
The compiler can eliminate the unreachable code entirely.
Disassembling Bytecode
View YARV instructions:
ruby --dump=insns -e 'code here'
View with more details:
ruby --dump=insns -e 'code here' 2>&1 | less
Disassemble a method:
require 'ruby_vm/instruction_sequence'
def example
x = 2 + 3
x * 4
end
puts RubyVM::InstructionSequence.disasm(method(:example))
Output:
== disasm: #<ISeq:example@...>
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] x@0
0000 putobject 5
0002 setlocal_WC_0 x@0
0004 getlocal_WC_0 x@0
0006 putobject 4
0008 opt_mult
0009 leave
This reveals the exact bytecode YARV executes.
Bytecode vs Machine Code
Aspect | Bytecode | Machine Code |
---|---|---|
Target | Virtual Machine | Physical CPU |
Portability | Platform-independent | Platform-specific |
Size | Compact | Larger |
Speed | Slower (interpreted) | Fastest |
Generation | Easier | Harder |
Optimization | Limited | Extensive |
Examples | YARV, JVM, Python | x86, ARM, RISC-V |
Just-In-Time (JIT) Compilation
Modern VMs compile hot bytecode paths to machine code:
Execution flow with JIT:
Bytecode → [Interpreter] → Execution (cold path)
↓
[Profile]
↓
Hot path detected?
↓
[JIT Compile]
↓
Machine Code → [CPU] → Faster Execution
Benefits:
- Starts fast (interpret bytecode)
- Speeds up over time (compile hot paths)
- Adaptive optimization (optimize based on actual usage)
Ruby’s YJIT (Yet Another Just-In-Time) compiler works this way.
Bytecode Verification
Some VMs verify bytecode before execution:
Safety checks:
- Type consistency
- Stack balance (no overflow/underflow)
- Valid instruction sequences
- Proper exception handling
- Memory safety
Example (Java):
1. Load .class file (bytecode)
2. Verify bytecode is well-formed
3. Verify type safety
4. Only then execute
This prevents malicious or corrupted bytecode from crashing the VM.
Format Examples
Java .class File
Magic Number: 0xCAFEBABE
Version: 52.0 (Java 8)
Constant Pool: [...]
Access Flags: public
This Class: MyClass
Super Class: Object
Interfaces: []
Fields: [...]
Methods: [...]
Attributes: [...]
Python .pyc File
Magic Number: 0x0a0d0d0a (Python version marker)
Timestamp: [modification time]
Source Size: [bytes]
Code Object: [bytecode + metadata]
YARV Instruction Sequence
RubyVM::InstructionSequence format:
- Magic number
- Version
- Type (method, block, class, etc.)
- Arguments info
- Local variable table
- Bytecode instructions
- Line number mapping
- Catch table (exceptions)
Inspecting YARV Compilation
Compile to instruction sequence:
iseq = RubyVM::InstructionSequence.compile("2 + 3")
puts iseq.disasm
Compile from file:
iseq = RubyVM::InstructionSequence.compile_file("script.rb")
iseq.to_a # Serialized format
Save bytecode:
iseq = RubyVM::InstructionSequence.compile_file("script.rb")
File.binwrite("script.yarv", iseq.to_binary)
Load bytecode:
binary = File.binread("script.yarv")
iseq = RubyVM::InstructionSequence.load_from_binary(binary)
iseq.eval # Execute
This allows shipping pre-compiled Ruby code (though rarely done in practice).