Bytecode Compilation

Bytecode compilation is the process of translating high-level source code into an intermediate representation called bytecode - a compact, platform-independent instruction format designed for execution by a virtual machine.

What is Bytecode?

Bytecode sits between source code and machine code:

graph LR
    A[Source Code] -->|Compile| B[Bytecode]
    B -->|Interpret/JIT| C[Machine Code]
    C -->|Execute| D[CPU]

    style A fill:#e3f2fd
    style B fill:#fff9c4
    style C fill:#f1f8e9
    style D fill:#fce4ec

Characteristics:

More abstract than machine code
More concrete than source code
Platform-independent
Optimized for VM execution
Usually not human-readable (binary format)

Compilation Pipeline

The journey from Ruby to YARV bytecode:

┌─────────────────────┐
│   Ruby Source       │
│   x = 2 + 3         │
└──────────┬──────────┘
           ↓
    [Lexical Analysis]
           ↓
┌─────────────────────┐
│   Token Stream      │
│   [x, =, 2, +, 3]   │
└──────────┬──────────┘
           ↓
    [Syntax Analysis]
           ↓
┌─────────────────────┐
│ Abstract Syntax Tree│
│      (=)            │
│     /   \           │
│    x    (+)         │
│        /   \        │
│       2     3       │
└──────────┬──────────┘
           ↓
   [Code Generation]
           ↓
┌─────────────────────┐
│  YARV Bytecode      │
│  putobject 2        │
│  putobject 3        │
│  opt_plus           │
│  setlocal x         │
└─────────────────────┘

Example: Ruby to YARV Bytecode

Ruby source:

nil

YARV bytecode:

$ ruby --dump=insns -e 'nil'
 
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,3)> (catch: FALSE)
0000 putnil                                   ( 1)[Li]
0001 leave

More complex example:

2 + 3

YARV bytecode:

0000 putobject_INT2FIX_1_
0001 putobject                                3
0003 opt_plus
0004 leave

This shows how high-level operations (+) become sequences of stack instructions.

Why Bytecode?

Advantages Over Direct Interpretation

1. Performance

Parse source code once, not every execution
Pre-optimized instruction sequences
Faster instruction dispatch
Smaller memory footprint

2. Portability

Same bytecode runs on any platform
Platform-specific VM handles execution
No recompilation needed

3. Optimization Opportunities

Constant folding at compile-time
Dead code elimination
Peephole optimization
JIT compilation can optimize hot paths

4. Security

Source code can remain private
Bytecode harder to reverse-engineer than source
Sandboxing and verification possible

Comparison: Interpretation vs Compilation

Pure Interpretation:
Source Code → [Interpreter] → Execution
  Fast startup, slow execution
  Example: Early JavaScript, shell scripts

Bytecode Compilation:
Source Code → [Compiler] → Bytecode → [VM] → Execution
  Moderate startup, faster execution
  Example: YARV, Python, Java

Native Compilation:
Source Code → [Compiler] → Machine Code → [CPU] → Execution
  Slow startup, fastest execution
  Example: C, C++, Rust, Go

Bytecode Structure

Bytecode consists of instructions and operands:

Instruction format:
┌──────────────┬─────────────┐
│  Opcode      │  Operands   │
└──────────────┴─────────────┘
   (what to do)  (data to use)

Example: putobject 42
         └────┬───┘ └┬─┘
           opcode   operand

Components:

Opcode - The operation to perform (e.g., putobject, add, jump)
Operands - Parameters to the operation (e.g., values, addresses, offsets)
Metadata - Line numbers, debug info, source locations

YARV Bytecode Encoding

YARV uses a compact encoding:

# Ruby source
x = 42
 
# Bytecode (conceptual)
putobject 42    # Push 42 onto stack
setlocal x, 0   # Store in local variable x

Encoding details:

Variable-length instructions
Inline operands for small values
Constant pool for complex objects
Line number mapping for debugging

Instruction Set Design

Bytecode instructions reflect the stack-based virtual machine architecture:

Stack Operations:

putnil          # Push nil
putobject <obj> # Push object
dup             # Duplicate top
pop             # Remove top
swap            # Swap top two

Arithmetic:

opt_plus        # Add top two values
opt_minus       # Subtract
opt_mult        # Multiply
opt_div         # Divide

Control Flow:

jump <offset>        # Unconditional jump
branchif <offset>    # Jump if true
branchunless <offset># Jump if false

Variables:

getlocal <index>     # Read local variable
setlocal <index>     # Write local variable
getinstancevariable  # Read @variable
setinstancevariable  # Write @variable

See YARV stack instructions for detailed exploration.

Optimizations During Compilation

1. Constant Folding

Before:

x = 2 + 3

Naive bytecode:

putobject 2
putobject 3
opt_plus
setlocal x

Optimized bytecode:

putobject 5    # Computed at compile-time
setlocal x

2. Specialized Instructions

YARV includes optimized instructions for common cases:

# Instead of:
putobject 0
 
# Use specialized:
putobject_INT2FIX_0_  # Smaller, faster

3. Inline Caching

Method calls can be optimized with inline caches:

obj.method_name

Bytecode stores the method lookup result for faster subsequent calls.

4. Dead Code Elimination

if false
  puts "never runs"
end

The compiler can eliminate the unreachable code entirely.

Disassembling Bytecode

View YARV instructions:

ruby --dump=insns -e 'code here'

View with more details:

ruby --dump=insns -e 'code here' 2>&1 | less

Disassemble a method:

require 'ruby_vm/instruction_sequence'
 
def example
  x = 2 + 3
  x * 4
end
 
puts RubyVM::InstructionSequence.disasm(method(:example))

Output:

== disasm: #<ISeq:example@...>
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] x@0
0000 putobject                                5
0002 setlocal_WC_0                            x@0
0004 getlocal_WC_0                            x@0
0006 putobject                                4
0008 opt_mult
0009 leave

This reveals the exact bytecode YARV executes.

Bytecode vs Machine Code

Aspect	Bytecode	Machine Code
Target	Virtual Machine	Physical CPU
Portability	Platform-independent	Platform-specific
Size	Compact	Larger
Speed	Slower (interpreted)	Fastest
Generation	Easier	Harder
Optimization	Limited	Extensive
Examples	YARV, JVM, Python	x86, ARM, RISC-V

Just-In-Time (JIT) Compilation

Modern VMs compile hot bytecode paths to machine code:

Execution flow with JIT:

Bytecode → [Interpreter] → Execution (cold path)
            ↓
        [Profile]
            ↓
      Hot path detected?
            ↓
        [JIT Compile]
            ↓
      Machine Code → [CPU] → Faster Execution

Benefits:

Starts fast (interpret bytecode)
Speeds up over time (compile hot paths)
Adaptive optimization (optimize based on actual usage)

Ruby’s YJIT (Yet Another Just-In-Time) compiler works this way.

Bytecode Verification

Some VMs verify bytecode before execution:

Safety checks:

Type consistency
Stack balance (no overflow/underflow)
Valid instruction sequences
Proper exception handling
Memory safety

Example (Java):

1. Load .class file (bytecode)
2. Verify bytecode is well-formed
3. Verify type safety
4. Only then execute

This prevents malicious or corrupted bytecode from crashing the VM.

Format Examples

Java .class File

Magic Number: 0xCAFEBABE
Version: 52.0 (Java 8)
Constant Pool: [...]
Access Flags: public
This Class: MyClass
Super Class: Object
Interfaces: []
Fields: [...]
Methods: [...]
Attributes: [...]

Python .pyc File

Magic Number: 0x0a0d0d0a (Python version marker)
Timestamp: [modification time]
Source Size: [bytes]
Code Object: [bytecode + metadata]

YARV Instruction Sequence

RubyVM::InstructionSequence format:
- Magic number
- Version
- Type (method, block, class, etc.)
- Arguments info
- Local variable table
- Bytecode instructions
- Line number mapping
- Catch table (exceptions)

Inspecting YARV Compilation

Compile to instruction sequence:

iseq = RubyVM::InstructionSequence.compile("2 + 3")
puts iseq.disasm

Compile from file:

iseq = RubyVM::InstructionSequence.compile_file("script.rb")
iseq.to_a  # Serialized format

Save bytecode:

iseq = RubyVM::InstructionSequence.compile_file("script.rb")
File.binwrite("script.yarv", iseq.to_binary)

Load bytecode:

binary = File.binread("script.yarv")
iseq = RubyVM::InstructionSequence.load_from_binary(binary)
iseq.eval  # Execute

This allows shipping pre-compiled Ruby code (though rarely done in practice).

Recent Writing

Recent Notes