Bytecode Interpreters
These are two different bees
π is "bumble"
buzz (π) say π
buzz ("honey")You likely knew this immediately. But how did the computer?
How many πs do we really need?
Potato, the fake language above, was originally a tree-walker, meaning it traversed a tree representation of its syntax and executed as it went. In every step from source code to interpretation it saw three π nodes
Now it has become a more complex and performant bytecode interpreter, which requires far less πs
Source β Tokens β Parser (AST) β Desugaring β Semantic Analysis β IR β Bytecode β VM β OutputTokenization
Tokenization breaks code into chunks called lexemes, essentially the raw characters of a file. Each lexeme is then classified into a token and given a type so the parser knows what to do with it2
+is a lexeme, which becomes a token when we decide what it represents - typically, the addition operator
Many languages break tokens into types including: keywords (if, else), literals (strings, numbers), identifiers (variables), operators (+, %), and separators (:, ;)2
Potato recognizes:
- one keyword,
say(log a value) - operators:
potato(addition),equals?,is(assignment), andgains(add assignment) - literals: numbers, strings (
""), and booleans (:)|:() - separators:
()and,
π main.potato
say :) equals? :(The above example can be broken down into tokens:
[:PRINT, :BOOLEAN(:)), :EQUALS_EQUALS, :BOOLEAN]Parsing
Now that weβve broken down our code, we need to piece it back together and give it structure
Like in human languages, to parse means to take input and give it meaning3
The parsing step will transform Potatoβs tokens into a tree representation, referred to as an Abstract Syntax Tree (AST)4, that will help the interpreter understand the relationship between tokens
If we printed the AST, it might look like:
PRINT
βββ EQUALS_EQUALS
βββ BOOLEAN true
βββ BOOLEANDesugaring
As languages grow they tend to introduce shorthand to make writing expressions quicker and easier6
For example, in potato if we want to increment a value we need to first sum it with 1 and then assign it back to itself:
num is num potato 1Which is kind of ugly. The answer could be to add a gains operator able to handle both addition and assignment:
num gains 1Once we do so we will have to update each step of the language pipeline to understand what to do when it sees this new node. As a language matures this could result in a massive grammar
However, gains isnβt doing anything that potato doesnβt already know how to do. We can avoid maintaining separate logic by replacing instances of ADD_ASSIGN AST nodes with ADD and ASSIGN nodes7
ADD_ASSIGN ASSIGN
βββ VARIABLE num βββ VARIABLE num
βββ NUMBER 1 βββ ADD
βββ VARIABLE num
βββ NUMBER 1This is the difference between core language (syntax the interpreter understands) and syntactic sugar (shorthand)7. A desugaring step occurs after parsing to translate sugared nodes into their core form before any later steps have to deal with them
def self.desugar_node(node)
case node.type
when :add_assign
var = node.children[0]
value = node.children[1]
AST::Node.new(:assign, nil, [
var,
AST::Node.new(:add, nil, [var, value])
], node.line)
else
# ...
end
endSemantic Analysis
The source code has been chopped up, given meaning, and organized into a tree structure
At the beginning I presented a snippet of potato code and hypothesized that someone would be able to pick out the amount of unique π values at a glance. Lets try that again, but from potatoβs perspective:
ASSIGN
βββ VARIABLE π
βββ STRING bumble
FUNCTION buzz
βββ PARAMS
β βββ PARAM π
βββ BODY
βββ PRINT
βββ VARIABLE π
FUNC_CALL buzz
βββ STRING honeyAlthough we can see the relationship between nodes we still donβt have any information about the relationship between values
This is the point where we need to consider some language design decisions, for instance:
- can a variableβs value change after assignment? (mutability)
- should we be able to use a variable that has not yet been assigned? (declaration order)
- do functions have access to outside variables or do they need to be passed in? (scoping)
- are variables declared in different places allowed to share a name? (shadowing)
Potatoβs answers are: yes, no, yes, and yes
Collectively these concepts inform our languageβs scoping and name resolution. The semantic analysis step walks the AST and resolves which symbols refer to the same value, tracking where each variable is defined, whether it is local or captured from an outer scope, and raising errors for anything used before it exists
By the end, weβll go from three πs to two:
global
βββ π 0 # variable
βββ buzz
βββ @π 0 # parameterThe semantic analysis step is also responsible for type checking (verifying that operations are performed on compatible types) and generally ensuring that the AST is correct enough to use in later steps8
Lowering
Tree-walking interpreters only require an AST to execute a program. This is straightforward, but slow. We need to build the AST, traverse, and interpret every time we run our code4
Bytecode interpreters avoid this by compiling a set of instructions the interpreter can use directly. This means we only need to do the upfront work (tokenization through to compilation) once4
Iβve said the word bytecode a lot, it just means instructions (basic actions) encoded into binary, often represented as hexadecimal4
These instructions, called Intermediate Representation (IR), express the most concise sequence of actions necessary to execute a program. A language pipeline may include multiple lowering steps wherein the AST is simplified and flattened into linear IR, weβll do this in one pass1
In order to execute our π example weβll need: Push, LoadVar, LoadCaptured, StoreVar, Print, Call, Return, and Jump instructions. One last time we traverse the AST and produce a to do list of instructions for the VM
case node.type
when :function
func_ir(node)
when :string
write IR::Push.new(node.value)
when :print
node.children.each { |child| ir(child) }
write IR::Print.new
when :variable
symbol = @cur_scope.lookup(node.value)
if symbol.kind == :captured
write IR::LoadCaptured.new(symbol.locals_index)
else
write IR::LoadVar.new(symbol.locals_index)
end
when :assign
ir(node.children[1])
var_name = node.children[0].value
index = @cur_scope.lookup(var_name)&.locals_index
write IR::StoreVar.new(index)
when :func_call
node.children.each { |child| ir(child) }
write IR::Call.new(@cur_scope.lookup(node.value).bytecode_location, node.children.size)
endVariables no longer have names, just indices that tell the VM where to find their values. StoreVar 0 writes a value to slot 0, so any later LoadVar 0 will be able to find something there. The VM doesnβt need to know anything about the program, it just follows the instructions step by step
Weβve gone from two πs to none
<struct Potato::IR::Push value="bumble"> # push to stack
<struct Potato::IR::StoreVar index=0> # grab from stack and store in locals[0]
<struct Potato::IR::Jump target=28> # ask VM to skip past function body
<struct Potato::IR::LoadVar index=0> # grab value from locals[0] and push to stack
<struct Potato::IR::Print> # print stack value
<struct Potato::IR::Return> # exit function scope
<struct Potato::IR::Push value="honey"> # push to stack
<struct Potato::IR::Call target=21, arg_count=1> # enter function scope, executeCompiling
The compiler encodes our instruction list into bytecode: the opcode (a byte representing the instruction type) and operand (an index or value) written into a binary file to be passed to the VM4
def self.write(f, opcode, value = nil)
f.write([opcode].pack("C"))
f.write([value].pack("L>")) if value
end
# ...
case instruction
when IR::Push
case instruction.value
when String
write(f, 0x06, instruction.value.bytesize)
f.write(instruction.value)
end
when IR::LoadVar
write(f, 0x04, instruction.index)
when IR::LoadCaptured
write(f, 0x0C, instruction.index)
when IR::StoreVar
write(f, 0x05, instruction.index)
when IR::Print
write(f, 0x03)
when IR::Call
write(f, 0x09, instruction.target)
f.write([instruction.arg_count].pack("L>"))
when IR::Return
write(f, 0x0A)
when IR::Jump
write(f, 0x0B, instruction.target)
endFor example, <struct Potato::IR::LoadVar index=0> becomes opcode 0x04 followed by a 4-byte operand for the index, 5 bytes total
hexadecimal uses numbers 0-9 and letters a-f. Itβs an easier way to write binary, for example:
0x04in hex is00000100in binary10
Interpreting (VM)
A bytecode VM has a straightforward task: read each byte, and follow the instruction it represents4
0600 0000 0662 756d 626c 6505 0000 0000
0b00 0000 1c04 0000 0000 030a 0600 0000
0568 6f6e 6579 0900 0000 1500 0000 01Potatoβs VM is stack based, which means operations work by pushing and popping values off the stack. For example, addition works by asking the VM to grab whatever two things are at the front of the stack, add them, and push the result back4
case byte
when 0x03 # print
puts stack.pop
when 0x04 # load variable
index = read
stack.push(locals[index])
when 0x0C # load captured variable
index = read
parent_locals = scopes.last[:locals]
stack.push(parent_locals[index])
when 0x05 # store var
index = read
locals[index] = stack.pop
when 0x06 # string
length = read
value = @bytecode[@pos, length].pack("C*")
@pos += length
stack.push(value)
when 0x09 # call
target = read
arg_count = read
scopes.push({ locals: locals, call_site: @pos })
locals = Array.new(size_of_cur_frame)
args = stack.pop(arg_count)
args.each_with_index { |arg, i| locals[i] = arg }
@pos = target
when 0x0A # return
scope = scopes.pop
locals = scope[:locals]
@pos = scope[:call_site]
when 0x0B # jump
@pos = read
endStepping through the π example:
0x06 Push value="bumble"
# stack: ["bumble"] locals: []
0x05 StoreVar index=0
# stack: [] locals: ["bumble"]
0x0B Jump target=28
0x06 Push value="honey"
# stack: ["honey"] locals: ["bumble"]
0x09 Call target=21, arg_count=1
# new frame, the arg is moved to locals
# stack: [] locals: ["honey"]
0x04 LoadVar index=0
# stack: ["honey"] locals: ["honey"]
0x03 Print
# > "honey"
0x0A Return
# move back to parent frame
# stack: [] locals: ["bumble"]There it is! π After quite a bit of chopping up, rearranging, and translating, honey is printed to the terminal
TLDR;
As code authors we might want to write a ton of πs, and thatβs okay! We can have as many πs as we want because the language pipeline will remove everything it doesnβt need as soon as it doesnβt need it
Source β Tokens β AST β Desugaring β Semantic Analysis β IR β Bytecode β VM β Output- Semantic Analysis: Who are these things and where do they live?
- IR: What do we need to do, in what order?
- Bytecode: How can we say all that.. but smaller?
- VM: Read tiny instructions, execute, repeat
Resources
- 1 Introduction to Compilers and Language Design, University of Notre Dame
- 2 Lexical Analysis
- 3 Parsing
- 4 Crafting Interpreters, Robert Nystrom
- 5 Language Implementation Patterns, Terence Parr
- 6 Syntactic Sugar
- 7 Programming Languages I, University of TΓΌbingen
- 8 Semantic Analysis Symbol Tables, Worcester Polytechnic
- 9 Semantic Analysis, Stanford
- 10 Introduction to Microcomputers