Bytecode compilation is the process of translating high-level source code into an intermediate representation called bytecode - a compact, platform-independent instruction format designed for execution by a virtual machine.

What is Bytecode?

Bytecode sits between source code and machine code:

graph LR
    A[Source Code] -->|Compile| B[Bytecode]
    B -->|Interpret/JIT| C[Machine Code]
    C -->|Execute| D[CPU]

    style A fill:#e3f2fd
    style B fill:#fff9c4
    style C fill:#f1f8e9
    style D fill:#fce4ec

Characteristics:

  • More abstract than machine code
  • More concrete than source code
  • Platform-independent
  • Optimized for VM execution
  • Usually not human-readable (binary format)

Compilation Pipeline

The journey from Ruby to YARV bytecode:

┌─────────────────────┐
│   Ruby Source       │
│   x = 2 + 3         │
└──────────┬──────────┘
           ↓
    [Lexical Analysis]
           ↓
┌─────────────────────┐
│   Token Stream      │
│   [x, =, 2, +, 3]   │
└──────────┬──────────┘
           ↓
    [Syntax Analysis]
           ↓
┌─────────────────────┐
│ Abstract Syntax Tree│
│      (=)            │
│     /   \           │
│    x    (+)         │
│        /   \        │
│       2     3       │
└──────────┬──────────┘
           ↓
   [Code Generation]
           ↓
┌─────────────────────┐
│  YARV Bytecode      │
│  putobject 2        │
│  putobject 3        │
│  opt_plus           │
│  setlocal x         │
└─────────────────────┘

Example: Ruby to YARV Bytecode

Ruby source:

nil

YARV bytecode:

$ ruby --dump=insns -e 'nil'
 
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,3)> (catch: FALSE)
0000 putnil                                   ( 1)[Li]
0001 leave

More complex example:

2 + 3

YARV bytecode:

0000 putobject_INT2FIX_1_
0001 putobject                                3
0003 opt_plus
0004 leave

This shows how high-level operations (+) become sequences of stack instructions.

Why Bytecode?

Advantages Over Direct Interpretation

1. Performance

  • Parse source code once, not every execution
  • Pre-optimized instruction sequences
  • Faster instruction dispatch
  • Smaller memory footprint

2. Portability

  • Same bytecode runs on any platform
  • Platform-specific VM handles execution
  • No recompilation needed

3. Optimization Opportunities

  • Constant folding at compile-time
  • Dead code elimination
  • Peephole optimization
  • JIT compilation can optimize hot paths

4. Security

  • Source code can remain private
  • Bytecode harder to reverse-engineer than source
  • Sandboxing and verification possible

Comparison: Interpretation vs Compilation

Pure Interpretation:
Source Code → [Interpreter] → Execution
  Fast startup, slow execution
  Example: Early JavaScript, shell scripts

Bytecode Compilation:
Source Code → [Compiler] → Bytecode → [VM] → Execution
  Moderate startup, faster execution
  Example: YARV, Python, Java

Native Compilation:
Source Code → [Compiler] → Machine Code → [CPU] → Execution
  Slow startup, fastest execution
  Example: C, C++, Rust, Go

Bytecode Structure

Bytecode consists of instructions and operands:

Instruction format:
┌──────────────┬─────────────┐
│  Opcode      │  Operands   │
└──────────────┴─────────────┘
   (what to do)  (data to use)

Example: putobject 42
         └────┬───┘ └┬─┘
           opcode   operand

Components:

  1. Opcode - The operation to perform (e.g., putobject, add, jump)
  2. Operands - Parameters to the operation (e.g., values, addresses, offsets)
  3. Metadata - Line numbers, debug info, source locations

YARV Bytecode Encoding

YARV uses a compact encoding:

# Ruby source
x = 42
 
# Bytecode (conceptual)
putobject 42    # Push 42 onto stack
setlocal x, 0   # Store in local variable x

Encoding details:

  • Variable-length instructions
  • Inline operands for small values
  • Constant pool for complex objects
  • Line number mapping for debugging

Instruction Set Design

Bytecode instructions reflect the stack-based virtual machine architecture:

Stack Operations:

putnil          # Push nil
putobject <obj> # Push object
dup             # Duplicate top
pop             # Remove top
swap            # Swap top two

Arithmetic:

opt_plus        # Add top two values
opt_minus       # Subtract
opt_mult        # Multiply
opt_div         # Divide

Control Flow:

jump <offset>        # Unconditional jump
branchif <offset>    # Jump if true
branchunless <offset># Jump if false

Variables:

getlocal <index>     # Read local variable
setlocal <index>     # Write local variable
getinstancevariable  # Read @variable
setinstancevariable  # Write @variable

See YARV stack instructions for detailed exploration.

Optimizations During Compilation

1. Constant Folding

Before:

x = 2 + 3

Naive bytecode:

putobject 2
putobject 3
opt_plus
setlocal x

Optimized bytecode:

putobject 5    # Computed at compile-time
setlocal x

2. Specialized Instructions

YARV includes optimized instructions for common cases:

# Instead of:
putobject 0
 
# Use specialized:
putobject_INT2FIX_0_  # Smaller, faster

3. Inline Caching

Method calls can be optimized with inline caches:

obj.method_name

Bytecode stores the method lookup result for faster subsequent calls.

4. Dead Code Elimination

if false
  puts "never runs"
end

The compiler can eliminate the unreachable code entirely.

Disassembling Bytecode

View YARV instructions:

ruby --dump=insns -e 'code here'

View with more details:

ruby --dump=insns -e 'code here' 2>&1 | less

Disassemble a method:

require 'ruby_vm/instruction_sequence'
 
def example
  x = 2 + 3
  x * 4
end
 
puts RubyVM::InstructionSequence.disasm(method(:example))

Output:

== disasm: #<ISeq:example@...>
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] x@0
0000 putobject                                5
0002 setlocal_WC_0                            x@0
0004 getlocal_WC_0                            x@0
0006 putobject                                4
0008 opt_mult
0009 leave

This reveals the exact bytecode YARV executes.

Bytecode vs Machine Code

AspectBytecodeMachine Code
TargetVirtual MachinePhysical CPU
PortabilityPlatform-independentPlatform-specific
SizeCompactLarger
SpeedSlower (interpreted)Fastest
GenerationEasierHarder
OptimizationLimitedExtensive
ExamplesYARV, JVM, Pythonx86, ARM, RISC-V

Just-In-Time (JIT) Compilation

Modern VMs compile hot bytecode paths to machine code:

Execution flow with JIT:

Bytecode → [Interpreter] → Execution (cold path)
            ↓
        [Profile]
            ↓
      Hot path detected?
            ↓
        [JIT Compile]
            ↓
      Machine Code → [CPU] → Faster Execution

Benefits:

  • Starts fast (interpret bytecode)
  • Speeds up over time (compile hot paths)
  • Adaptive optimization (optimize based on actual usage)

Ruby’s YJIT (Yet Another Just-In-Time) compiler works this way.

Bytecode Verification

Some VMs verify bytecode before execution:

Safety checks:

  • Type consistency
  • Stack balance (no overflow/underflow)
  • Valid instruction sequences
  • Proper exception handling
  • Memory safety

Example (Java):

1. Load .class file (bytecode)
2. Verify bytecode is well-formed
3. Verify type safety
4. Only then execute

This prevents malicious or corrupted bytecode from crashing the VM.

Format Examples

Java .class File

Magic Number: 0xCAFEBABE
Version: 52.0 (Java 8)
Constant Pool: [...]
Access Flags: public
This Class: MyClass
Super Class: Object
Interfaces: []
Fields: [...]
Methods: [...]
Attributes: [...]

Python .pyc File

Magic Number: 0x0a0d0d0a (Python version marker)
Timestamp: [modification time]
Source Size: [bytes]
Code Object: [bytecode + metadata]

YARV Instruction Sequence

RubyVM::InstructionSequence format:
- Magic number
- Version
- Type (method, block, class, etc.)
- Arguments info
- Local variable table
- Bytecode instructions
- Line number mapping
- Catch table (exceptions)

Inspecting YARV Compilation

Compile to instruction sequence:

iseq = RubyVM::InstructionSequence.compile("2 + 3")
puts iseq.disasm

Compile from file:

iseq = RubyVM::InstructionSequence.compile_file("script.rb")
iseq.to_a  # Serialized format

Save bytecode:

iseq = RubyVM::InstructionSequence.compile_file("script.rb")
File.binwrite("script.yarv", iseq.to_binary)

Load bytecode:

binary = File.binread("script.yarv")
iseq = RubyVM::InstructionSequence.load_from_binary(binary)
iseq.eval  # Execute

This allows shipping pre-compiled Ruby code (though rarely done in practice).