Chapter 19: Virtual Machine
Terms defined: Application Binary Interface, assembler, assembly code, bitwise operation, disassembler, instruction pointer, instruction set, label (address in memory), op code, register, virtual machine, word (of memory)
Computers don’t execute JavaScript directly. Instead, each processor has its own instruction set, and a compiler translates high-level languages into those instructions. Compilers often use an intermediate representation called assembly code that gives instructions names instead of numbers. To understand more about how JavaScript actually runs, we will simulate a very simple processor with a little bit of memory. If you want to dive deeper, have a look at Bob Nystrom’s Crafting Interpreters. You may also enjoy Human Resource Machine, which asks you to solve puzzles of increasing difficulty using a processor almost as simple as ours.
Section 19.1: What is the architecture of our virtual machine?
Our virtual machine has three parts, which are shown in Figure 19.1 for a program made up of 110 instructions:
-
An instruction pointer (IP) that holds the memory address of the next instruction to execute. It is automatically initialized to point at address 0, which is where every program must start. This rule is part of the Application Binary Interface (ABI) for our virtual machine.
-
Four registers named R0 to R3 that instructions can access directly. There are no memory-to-memory operations in our VM: everything happens in or through registers.
-
256 words of memory, each of which can store a single value. Both the program and its data live in this single block of memory; we chose the size 256 so that each address will fit in a single byte.
The instructions for our VM are 3 bytes long.
The op code fits into one byte,
and each instruction may optionally include one or two single-byte operands.
Each operand is a register identifier,
a constant,
or an address
(which is just a constant that identifies a location in memory);
since constants have to fit in one byte,
the largest number we can represent directly is 256.
Table 19.1 uses the letters r
, c
, and a
to indicate instruction format,
where r
indicates a register identifier,
c
indicates a constant,
and a
indicates an address.
Instruction | Code | Format | Action | Example | Equivalent |
---|---|---|---|---|---|
hlt |
1 | -- |
Halt program | hlt |
process.exit(0) |
ldc |
2 | rc |
Load immediate | ldc R0 123 |
R0 := 123 |
ldr |
3 | rr |
Load register | ldr R0 R1 |
R0 := RAM[R1] |
cpy |
4 | rr |
Copy register | cpy R0 R1 |
R0 := R1 |
str |
5 | rr |
Store register | str R0 R1 |
RAM[R1] := R0 |
add |
6 | rr |
Add | add R0 R1 |
R0 := R0 + R1 |
sub |
7 | rr |
Subtract | sub R0 R1 |
R0 := R0 - R1 |
beq |
8 | ra |
Branch if equal | beq R0 123 |
if (R0 === 0) PC := 123 |
bne |
9 | ra |
Branch if not equal | bne R0 123 |
if (R0 !== 0) PC := 123 |
prr |
10 | r- |
Print register | prr R0 |
console.log(R0) |
prm |
11 | r- |
Print memory | prm R0 |
console.log(RAM[R0]) |
We put our VM’s architectural details in a file that can be shared by other components:
const OPS = {
hlt: { code: 1, fmt: '--' }, // Halt program
ldc: { code: 2, fmt: 'rv' }, // Load immediate
ldr: { code: 3, fmt: 'rr' }, // Load register
cpy: { code: 4, fmt: 'rr' }, // Copy register
str: { code: 5, fmt: 'rr' }, // Store register
add: { code: 6, fmt: 'rr' }, // Add
sub: { code: 7, fmt: 'rr' }, // Subtract
beq: { code: 8, fmt: 'rv' }, // Branch if equal
bne: { code: 9, fmt: 'rv' }, // Branch if not equal
prr: { code: 10, fmt: 'r-' }, // Print register
prm: { code: 11, fmt: 'r-' } // Print memory
}
const OP_MASK = 0xFF // select a single byte
const OP_SHIFT = 8 // shift up by one byte
const OP_WIDTH = 6 // op width in characters when printing
const NUM_REG = 4 // number of registers
const RAM_LEN = 256 // number of words in RAM
export {
OPS,
OP_MASK,
OP_SHIFT,
OP_WIDTH,
NUM_REG,
RAM_LEN
}
While there isn’t a name for this design pattern, putting all the constants that define a system in one file instead of scattering them across multiple files makes them easier to find as well as ensuring consistency.
Section 19.2: How can we execute these instructions?
As in previous chapters, we will split a class that would normally be written in one piece into several parts for exposition. We start by defining a class with an instruction pointer, some registers, and some memory along with a prompt for output:
import assert from 'assert'
import {
OP_MASK,
OP_SHIFT,
NUM_REG,
RAM_LEN
} from './architecture.js'
const COLUMNS = 4
const DIGITS = 8
class VirtualMachineBase {
constructor () {
this.ip = 0
this.reg = Array(NUM_REG)
this.ram = Array(RAM_LEN)
this.prompt = '>>'
}
}
export default VirtualMachineBase
A program is just an array of numbers representing instructions. To load one, we copy those numbers into memory and reset the instruction pointer and registers:
initialize (program) {
assert(program.length <= this.ram.length,
'Program is too long for memory')
for (let i = 0; i < this.ram.length; i += 1) {
if (i < program.length) {
this.ram[i] = program[i]
} else {
this.ram[i] = 0
}
}
this.ip = 0
this.reg.fill(0)
}
In order to handle the next instruction, the VM gets the value in memory that the instruction pointer currently refers to and moves the instruction pointer on by one address. It then uses bitwise operations to extract the op code and operands from the instruction (Figure 19.2):
fetch () {
assert((0 <= this.ip) && (this.ip < RAM_LEN),
`Program counter ${this.ip} out of range 0..${RAM_LEN}`)
let instruction = this.ram[this.ip]
this.ip += 1
const op = instruction & OP_MASK
instruction >>= OP_SHIFT
const arg0 = instruction & OP_MASK
instruction >>= OP_SHIFT
const arg1 = instruction & OP_MASK
return [op, arg0, arg1]
}
Semi-realistic
We always unpack two operands regardless of whether the instructions have them or not because this is what a hardware implementation would do. We have also included assertions in our VM to simulate the way that real hardware includes logic to detect illegal instructions and out-of-bound memory addresses.
The next step is to extend our base class with one that has a run
method.
As its name suggests,
this runs the program by fetching instructions and executing them until told to stop:
import assert from 'assert'
import {
OPS
} from './architecture.js'
import VirtualMachineBase from './vm-base.js'
class VirtualMachine extends VirtualMachineBase {
run () {
let running = true
while (running) {
const [op, arg0, arg1] = this.fetch()
switch (op) {
case OPS.hlt.code:
running = false
break
case OPS.ldc.code:
this.assertIsRegister(arg0, op)
this.reg[arg0] = arg1
break
default:
assert(false, `Unknown op ${op}`)
break
}
}
}
assertIsRegister (reg) {
assert((0 <= reg) && (reg < this.reg.length),
`Invalid register ${reg}`)
}
assertIsAddress (addr) {
assert((0 <= addr) && (addr < this.ram.length),
`Invalid register ${addr}`)
}
}
export default VirtualMachine
Some instructions are very similar to others, so we will only look at three here. The first stores the value of one register in the address held by another register:
case OPS.str.code:
this.assertIsRegister(arg0, op)
this.assertIsRegister(arg1, op)
this.assertIsAddress(this.reg[arg1], op)
this.ram[this.reg[arg1]] = this.reg[arg0]
break
The first three lines check that the operation is legal; the fourth one uses the value in one register as an address, which is why it has nested array indexing.
Adding the value in one register to the value in another register is simpler:
case OPS.add.code:
this.assertIsRegister(arg0, op)
this.assertIsRegister(arg1, op)
this.reg[arg0] += this.reg[arg1]
break
as is jumping to a fixed address if the value in a register is zero:
case OPS.beq.code:
this.assertIsRegister(arg0, op)
this.assertIsAddress(arg1, op)
if (this.reg[arg0] === 0) {
this.ip = arg1
}
break
Section 19.3: What do assembly programs look like?
We could figure out numerical op codes by hand, and in fact that’s what the first programmers did. However, it is much easier to use an assembler, which is just a small compiler for a language that very closely represents actual machine instructions.
Each command in our assembly languages matches an instruction in the VM. Here’s an assembly language program to print the value stored in R1 and then halt:
# Print initial contents of R1.
prr R1
hlt
Its numeric representation is:
00010a
000001
One thing the assembly language has that the instruction set doesn’t
is labels on addresses.
The label loop
doesn’t take up any space;
instead,
it tells the assembler to give the address of the next instruction a name
so that we can refer to that address as @loop
in jump instructions.
For example,
this program prints the numbers from 0 to 2
(Figure 19.3):
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
000002
030102
00000a
010202
020006
010204
000207
020209
000001
Let’s trace this program’s execution (Figure 19.4):
- R0 holds the current loop index.
- R1 holds the loop’s upper bound (in this case 3).
- The loop prints the value of R0 (one instruction).
- The program adds 1 to R0. This takes two instructions because we can only add register-to-register.
- It checks to see if we should loop again, which takes three instructions.
- If the program doesn’t jump back, it halts.
The implementation of the assembler mirrors the simplicity of assembly language. The main method gets interesting lines, finds the addresses of labels, and turns each remaining line into an instruction:
assemble (lines) {
lines = this.cleanLines(lines)
const labels = this.findLabels(lines)
const instructions = lines.filter(line => !this.isLabel(line))
const compiled = instructions.map(instr => this.compile(instr, labels))
const program = this.instructionsToText(compiled)
return program
}
cleanLines (lines) {
return lines
.map(line => line.trim())
.filter(line => line.length > 0)
.filter(line => !this.isComment(line))
}
isComment (line) {
return line.startsWith('#')
}
To find labels, we go through the lines one by one and either save the label or increment the current address (because labels don’t take up space):
findLabels (lines) {
const result = {}
let index = 0
lines.forEach(line => {
if (this.isLabel(line)) {
const label = line.slice(0, -1)
assert(!(label in result),
`Duplicate label ${label}`)
result[label] = index
} else {
index += 1
}
})
return result
}
isLabel (line) {
return line.endsWith(':')
}
To compile a single instruction we break the line into tokens, look up the format for the operands, and pack them into a single value:
compile (instruction, labels) {
const [op, ...args] = instruction.split(/\s+/)
assert(op in OPS,
`Unknown operation "${op}"`)
let result = 0
switch (OPS[op].fmt) {
case '--':
result = this.combine(
OPS[op].code
)
break
case 'r-':
result = this.combine(
this.register(args[0]),
OPS[op].code
)
break
case 'rr':
result = this.combine(
this.register(args[1]),
this.register(args[0]),
OPS[op].code
)
break
case 'rv':
result = this.combine(
this.value(args[1], labels),
this.register(args[0]),
OPS[op].code
)
break
default:
assert(false,
`Unknown instruction format ${OPS[op].fmt}`)
}
return result
}
Combining op codes and operands into a single value is the reverse of the unpacking done by the virtual machine:
combine (...args) {
assert(args.length > 0,
'Cannot combine no arguments')
let result = 0
for (const a of args) {
result <<= OP_SHIFT
result |= a
}
return result
}
Finally, we need few utility functions:
instructionsToText (program) {
return program.map(op => op.toString(16).padStart(OP_WIDTH, '0'))
}
register (token) {
assert(token[0] === 'R',
`Register "${token}" does not start with 'R'`)
const r = parseInt(token.slice(1))
assert((0 <= r) && (r < NUM_REG),
`Illegal register ${token}`)
return r
}
value (token, labels) {
if (token[0] !== '@') {
return parseInt(token)
}
const labelName = token.slice(1)
assert(labelName in labels,
`Unknown label "${token}"`)
return labels[labelName]
}
Let’s try assembling a program and display its output, the registers, and the interesting contents of memory. As a test, this program counts up to three:
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
>> 0
>> 1
>> 2
R0 = 3
R1 = 3
R2 = 0
R3 = 0
0: 00000002 00030102 0000000a 00010202
4: 00020006 00010204 00000207 00020209
8: 00000001 00000000 00000000 00000000
Section 19.4: How can we store data?
It is tedious to write interesting programs when each value needs a unique name. We can do a lot more once we have collections like arrays, so let’s add those to our assembler. We don’t have to make any changes to the virtual machine, which doesn’t care if we think of a bunch of numbers as individuals or elements of an array, but we do need a way to create arrays and refer to them.
We will allocate storage for arrays at the end of the program
by using .data
on a line of its own to mark the start of the data section
and then label: number
to give a region a name and allocate some storage space
(Figure 19.5).
This enhancement only requires a few changes to the assembler. First, we need to split the lines into instructions and data allocations:
assemble (lines) {
lines = this.cleanLines(lines)
const [toCompile, toAllocate] = this.splitAllocations(lines)
const labels = this.findLabels(lines)
const instructions = toCompile.filter(line => !this.isLabel(line))
const baseOfData = instructions.length
this.addAllocations(baseOfData, labels, toAllocate)
const compiled = instructions.map(instr => this.compile(instr, labels))
const program = this.instructionsToText(compiled)
return program
}
splitAllocations (lines) {
const split = lines.indexOf(DIVIDER)
if (split === -1) {
return [lines, []]
} else {
return [lines.slice(0, split), lines.slice(split + 1)]
}
}
Second, we need to figure out where each allocation lies and create a label accordingly:
addAllocations (baseOfData, labels, toAllocate) {
toAllocate.forEach(alloc => {
const fields = alloc.split(':').map(a => a.trim())
assert(fields.length === 2,
`Invalid allocation directive "${alloc}"`)
const [label, numWordsText] = fields
assert(!(label in labels),
`Duplicate label "${label}" in data allocation`)
const numWords = parseInt(numWordsText)
assert((baseOfData + numWords) < RAM_LEN,
`Allocation "${label}" requires too much memory`)
labels[label] = baseOfData
baseOfData += numWords
})
}
And that’s it: no other changes are needed to either compilation or execution. To test it, let’s fill an array with the numbers from 0 to 3:
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
# - R2: array index.
# - R3: temporary.
ldc R0 0
ldc R1 3
ldc R2 @array
loop:
str R0 R2
ldc R3 1
add R0 R3
add R2 R3
cpy R3 R1
sub R3 R0
bne R3 @loop
hlt
.data
array: 10
R0 = 3
R1 = 3
R2 = 14
R3 = 0
0: 00000002 00030102 000b0202 00020005
4: 00010302 00030006 00030206 00010304
8: 00000307 00030309 00000001 00000000
c: 00000001 00000002 00000000 00000000
How does it actually work?
Our VM is just another program. If you’d like to know what happens when instructions finally meet hardware, and how electrical circuits are able to do arithmetic, make decisions, and talk to the world, [Patterson2017] has everything you want to know and more.
Section 19.5: Exercises
Swapping values
Write an assembly language program that swaps the values in R1 and R2 without affecting the values in other registers.
Reversing an array
Write an assembly language program that starts with:
- the base address of an array in one word
- the length of the array N in the next word
- N values immediately thereafter
and reverses the array in place.
Increment and decrement
-
Add instructions
inc
anddec
that add one to the value of a register and subtract one from the value of a register respectively. -
Rewrite the examples to use these instructions. How much shorter do they make the programs? How much easier to read?
Using long addresses
-
Modify the virtual machine so that the
ldr
andstr
instructions contain 16-bit addresses rather than 8-bit addresses and increase the virtual machine’s memory to 64K words to match. -
How does this complicate instruction interpretation?
Operating on strings
The C programming language stored character strings as non-zero bytes terminated by a byte containing zero.
-
Write a program that starts with the base address of a string in R1 and finishes with the length of the string (not including the terminator) in the same register.
-
Write a program that starts with the base address of a string in R1 and the base address of some other block of memory in R2 and copies the string to that new location (including the terminator).
-
What happens in each case if the terminator is missing?
Call and return
-
Add another register to the virtual machine called SP (“stack pointer”) that is automatically initialized to the last address in memory.
-
Add an instruction
psh
(short for “push”) that copies a value from a register to the address stored in SP and then subtracts one from SP. -
Add an instruction
pop
(short for “pop”) that adds one to SP and then copies a value from that address into a register. -
Using these instructions, write a subroutine that evaluates
2x+1
for every value in an array.
Disassembling instructions
A disassembler turns machine instructions into assembly code.
Write a disassembler for the instruction set used by our virtual machine.
(Since the labels for addresses are not stored in machine instructions,
disassemblers typically generate labels like @L001
and @L002
.)
Linking multiple files
-
Modify the assembler to handle
.include filename
directives. -
What does your modified assembler do about duplicate label names? How does it prevent infinite includes (i.e.,
A.as
includesB.as
which includesA.as
again)?
Providing system calls
Modify the virtual machine so that developers can add “system calls” to it.
-
On startup, the virtual machine loads an array of functions defined in a file called
syscalls.js
. -
The
sys
instruction takes a one-byte constant argument. It looks up the corresponding function and calls it with the values of R0-R3 as parameters and places the result in R0.
Unit testing
-
Write unit tests for the assembler.
-
Once they are working, write unit tests for the virtual machine.