Bytecode Interpreters

These are two different bees

🐝 is "bumble"
 
buzz (🐝) say 🐝
 
buzz ("honey")

You likely knew this immediately. But how did the computer?

How many 🐝s do we really need?

Potato, the fake language above, was originally a tree-walker, meaning it traversed a tree representation of its syntax and executed as it went. In every step from source code to interpretation it saw three 🐝 nodes

Now it has become a more complex and performant bytecode interpreter, which requires far less 🐝s

Source β†’ Tokens β†’ Parser (AST) β†’ Desugaring β†’ Semantic Analysis β†’ IR β†’ Bytecode β†’ VM β†’ Output

Tokenization

Tokenization breaks code into chunks called lexemes, essentially the raw characters of a file. Each lexeme is then classified into a token and given a type so the parser knows what to do with it2

+ is a lexeme, which becomes a token when we decide what it represents - typically, the addition operator

Many languages break tokens into types including: keywords (if, else), literals (strings, numbers), identifiers (variables), operators (+, %), and separators (:, ;)2

Potato recognizes:

  • one keyword, say (log a value)
  • operators: potato (addition), equals?, is (assignment), and gains (add assignment)
  • literals: numbers, strings (""), and booleans (:)|:()
  • separators: () and ,
🍠 main.potato
 
say :) equals? :(

The above example can be broken down into tokens:

[:PRINT, :BOOLEAN(:)), :EQUALS_EQUALS, :BOOLEAN]

Parsing

Now that we’ve broken down our code, we need to piece it back together and give it structure

Like in human languages, to parse means to take input and give it meaning3

The parsing step will transform Potato’s tokens into a tree representation, referred to as an Abstract Syntax Tree (AST)4, that will help the interpreter understand the relationship between tokens

If we printed the AST, it might look like:

PRINT
└── EQUALS_EQUALS
    β”œβ”€β”€ BOOLEAN true
    └── BOOLEAN

Desugaring

As languages grow they tend to introduce shorthand to make writing expressions quicker and easier6

For example, in potato if we want to increment a value we need to first sum it with 1 and then assign it back to itself:

num is num potato 1

Which is kind of ugly. The answer could be to add a gains operator able to handle both addition and assignment:

num gains 1

Once we do so we will have to update each step of the language pipeline to understand what to do when it sees this new node. As a language matures this could result in a massive grammar

However, gains isn’t doing anything that potato doesn’t already know how to do. We can avoid maintaining separate logic by replacing instances of ADD_ASSIGN AST nodes with ADD and ASSIGN nodes7

ADD_ASSIGN              ASSIGN
β”œβ”€β”€ VARIABLE num        β”œβ”€β”€ VARIABLE num
└── NUMBER 1            └── ADD
                            β”œβ”€β”€ VARIABLE num
                            └── NUMBER 1

This is the difference between core language (syntax the interpreter understands) and syntactic sugar (shorthand)7. A desugaring step occurs after parsing to translate sugared nodes into their core form before any later steps have to deal with them

def self.desugar_node(node)
  case node.type
  when :add_assign
    var = node.children[0]
    value = node.children[1]
 
    AST::Node.new(:assign, nil, [
      var,
      AST::Node.new(:add, nil, [var, value])
    ], node.line)
  else
    # ...
  end
end

Semantic Analysis

The source code has been chopped up, given meaning, and organized into a tree structure

At the beginning I presented a snippet of potato code and hypothesized that someone would be able to pick out the amount of unique 🐝 values at a glance. Lets try that again, but from potato’s perspective:

ASSIGN
β”œβ”€β”€ VARIABLE 🐝
└── STRING bumble
FUNCTION buzz
β”œβ”€β”€ PARAMS
β”‚   └── PARAM 🐝
└── BODY
    └── PRINT
        └── VARIABLE 🐝
FUNC_CALL buzz
└── STRING honey

Although we can see the relationship between nodes we still don’t have any information about the relationship between values

This is the point where we need to consider some language design decisions, for instance:

  • can a variable’s value change after assignment? (mutability)
  • should we be able to use a variable that has not yet been assigned? (declaration order)
  • do functions have access to outside variables or do they need to be passed in? (scoping)
  • are variables declared in different places allowed to share a name? (shadowing)

Potato’s answers are: yes, no, yes, and yes

Collectively these concepts inform our language’s scoping and name resolution. The semantic analysis step walks the AST and resolves which symbols refer to the same value, tracking where each variable is defined, whether it is local or captured from an outer scope, and raising errors for anything used before it exists

By the end, we’ll go from three 🐝s to two:

global
β”œβ”€β”€ 🐝 0 # variable
└── buzz
    └── @🐝 0 # parameter

The semantic analysis step is also responsible for type checking (verifying that operations are performed on compatible types) and generally ensuring that the AST is correct enough to use in later steps8

Lowering

Tree-walking interpreters only require an AST to execute a program. This is straightforward, but slow. We need to build the AST, traverse, and interpret every time we run our code4

Bytecode interpreters avoid this by compiling a set of instructions the interpreter can use directly. This means we only need to do the upfront work (tokenization through to compilation) once4

I’ve said the word bytecode a lot, it just means instructions (basic actions) encoded into binary, often represented as hexadecimal4

These instructions, called Intermediate Representation (IR), express the most concise sequence of actions necessary to execute a program. A language pipeline may include multiple lowering steps wherein the AST is simplified and flattened into linear IR, we’ll do this in one pass1

In order to execute our 🐝 example we’ll need: Push, LoadVar, LoadCaptured, StoreVar, Print, Call, Return, and Jump instructions. One last time we traverse the AST and produce a to do list of instructions for the VM

case node.type
when :function
  func_ir(node)
 
when :string
  write IR::Push.new(node.value)
 
when :print
  node.children.each { |child| ir(child) }
  write IR::Print.new
 
when :variable
  symbol = @cur_scope.lookup(node.value)
 
  if symbol.kind == :captured
    write IR::LoadCaptured.new(symbol.locals_index)
  else
    write IR::LoadVar.new(symbol.locals_index)
  end
 
when :assign
  ir(node.children[1])
  var_name = node.children[0].value
  index = @cur_scope.lookup(var_name)&.locals_index
  write IR::StoreVar.new(index)
 
when :func_call
  node.children.each { |child| ir(child) }
  write IR::Call.new(@cur_scope.lookup(node.value).bytecode_location, node.children.size)
end

Variables no longer have names, just indices that tell the VM where to find their values. StoreVar 0 writes a value to slot 0, so any later LoadVar 0 will be able to find something there. The VM doesn’t need to know anything about the program, it just follows the instructions step by step

We’ve gone from two 🐝s to none

<struct Potato::IR::Push value="bumble"> # push to stack
<struct Potato::IR::StoreVar index=0> # grab from stack and store in locals[0]
<struct Potato::IR::Jump target=28> # ask VM to skip past function body
<struct Potato::IR::LoadVar index=0> # grab value from locals[0] and push to stack
<struct Potato::IR::Print> # print stack value
<struct Potato::IR::Return> # exit function scope
<struct Potato::IR::Push value="honey"> # push to stack
<struct Potato::IR::Call target=21, arg_count=1> # enter function scope, execute

Compiling

The compiler encodes our instruction list into bytecode: the opcode (a byte representing the instruction type) and operand (an index or value) written into a binary file to be passed to the VM4

def self.write(f, opcode, value = nil)
  f.write([opcode].pack("C"))
  f.write([value].pack("L>")) if value
end
 
# ...
 
case instruction
when IR::Push
  case instruction.value
  when String
    write(f, 0x06, instruction.value.bytesize)
    f.write(instruction.value)
  end
when IR::LoadVar
  write(f, 0x04, instruction.index)
when IR::LoadCaptured
  write(f, 0x0C, instruction.index)
when IR::StoreVar
  write(f, 0x05, instruction.index)
when IR::Print
  write(f, 0x03)
when IR::Call
  write(f, 0x09, instruction.target)
  f.write([instruction.arg_count].pack("L>"))
when IR::Return
  write(f, 0x0A)
when IR::Jump
  write(f, 0x0B, instruction.target)
end

For example, <struct Potato::IR::LoadVar index=0> becomes opcode 0x04 followed by a 4-byte operand for the index, 5 bytes total

hexadecimal uses numbers 0-9 and letters a-f. It’s an easier way to write binary, for example: 0x04 in hex is 00000100 in binary10

Interpreting (VM)

A bytecode VM has a straightforward task: read each byte, and follow the instruction it represents4

0600 0000 0662 756d 626c 6505 0000 0000
0b00 0000 1c04 0000 0000 030a 0600 0000
0568 6f6e 6579 0900 0000 1500 0000 01

Potato’s VM is stack based, which means operations work by pushing and popping values off the stack. For example, addition works by asking the VM to grab whatever two things are at the front of the stack, add them, and push the result back4

case byte
when 0x03 # print
  puts stack.pop
when 0x04 # load variable
  index = read
  stack.push(locals[index])
when 0x0C # load captured variable
  index = read
  parent_locals = scopes.last[:locals]
  stack.push(parent_locals[index])
when 0x05 # store var
  index = read
  locals[index] = stack.pop
when 0x06 # string
  length = read
  value = @bytecode[@pos, length].pack("C*")
  @pos += length
  stack.push(value)
when 0x09 # call
  target = read
  arg_count = read
  scopes.push({ locals: locals, call_site: @pos })
  locals = Array.new(size_of_cur_frame)
  args = stack.pop(arg_count)
  args.each_with_index { |arg, i| locals[i] = arg }
  @pos = target
when 0x0A # return
  scope = scopes.pop
  locals = scope[:locals]
  @pos = scope[:call_site]
when 0x0B # jump
  @pos = read
end

Stepping through the 🐝 example:

0x06 Push value="bumble"
# stack: ["bumble"]  locals: []
 
0x05 StoreVar index=0
# stack: []  locals: ["bumble"]
 
0x0B Jump target=28
 
0x06 Push value="honey"
# stack: ["honey"]  locals: ["bumble"]
 
0x09 Call target=21, arg_count=1
# new frame, the arg is moved to locals
# stack: []  locals: ["honey"]
 
0x04 LoadVar index=0
# stack: ["honey"]  locals: ["honey"]
 
0x03 Print
# > "honey"
 
0x0A Return
# move back to parent frame
# stack: []  locals: ["bumble"]

There it is! πŸŽ‰ After quite a bit of chopping up, rearranging, and translating, honey is printed to the terminal

TLDR;

As code authors we might want to write a ton of 🐝s, and that’s okay! We can have as many 🐝s as we want because the language pipeline will remove everything it doesn’t need as soon as it doesn’t need it

Source β†’ Tokens β†’ AST β†’ Desugaring β†’ Semantic Analysis β†’ IR β†’ Bytecode β†’ VM β†’ Output
  • Semantic Analysis: Who are these things and where do they live?
  • IR: What do we need to do, in what order?
  • Bytecode: How can we say all that.. but smaller?
  • VM: Read tiny instructions, execute, repeat

Resources