Devlog: Writing a CHIP-8 emulator
Table of Contents

I first heard of CHIP-8 when I was beginning to learn Rust. A Redditor suggested it as a reasonably-sized but simple project to get started with a new language. I tried it out, but was unable to even get the obligatory IBM logo program to run; likely due to a lack of Rust knowledge, inexperience with programming, or both. Anyways, I bookmarked the tutorial with the intent of trying it out some other day, and moved on to other projects.
Wellll, 3 years have passed since then, and I’ve become much better at Rust and programming in general. Recently, I started learning C in Harvard’s CS50x course, and I really liked its simplicity and closeness to the bare metal. After completing the course’s C assignments, and creating a small Arduino project , I yearned for more C (as one does).
Creating the Emulator#
Resources Used#
I followed Tobias Langhoff’s guide , which points out common pitfalls, and is overall an excellent resource for implementing features iteratively.
Later on, I discovered Timendus’ test suite , which I wish I had known about earlier. It is invaluable for checking the implementation of the more nuanced behaviour of the CHIP-8 and its derivatives, such as carry flag behaviour, and implementation-specific quirks that have emerged over the years of various emulators being developed for different platforms.
If you are making your own emulator, I highly recommend joining the Emulator Development Discord server, and making use of the #chip-8 and #resources-systems channels. There are active and experienced people there who can help you with issues in your emulator. Use the websites linked in the server if you would like to find more resources and documentation about the CHIP-8.
I do not recommend using search engines to find resources, because it often surfaces incorrect/outdated websites. Use the links in the EmuDev server, and ask in the #chip-8 channel if you really cannot find out about something.
Initial Implementation#
The first step was defining variables for the most crucial components of the system; RAM, the display buffer, and the program counter (PC
), index (I
), and variable (V0
-VF
) registers. I dumped these into global scope for the moment; I wanted to worry about separating the core and machine state later.
I copied the font from Tobias’ guide into a constant, and loaded the font into RAM beginning at address 0x050
. Then, I loaded the ROM file specified in the command-line argument into RAM, beginning at address 0x200
.
The execution loop of the CHIP-8 is quite simple:
Fetch 2 bytes from RAM at the address in the program counter, and store it in a 16-bit
instruction
variable. The architecture is big-endian, i.e. the first byte (the one with the lesser address) is the more significant byte.Increment the program counter by 2, since we just loaded 2 bytes.
‘Decode’ the instruction. CHIP-8 instructions pack the opcode and operands into a single 16-bit number. So, we need to mask out parts of it to determine the operands.
Mask Description Used _x__
2nd nibble For x
th variable register, written asVx
__y_
3rd nibble For y
th variable register, written asVy
___n
4th nibble As a 4-bit number (nibble) __nn
3rd & 4th nibbles As an 8-bit number (byte) _nnn
2nd, 3rd, and 4th nibbles As a 12-bit number The purpose of the last 3 operands depends on the opcode.
I extracted these operands by masking out the required nibble(s) usingAND
, and bit-shifting them to the unit’s place. I used#define
for these instead of dedicated variables. Remember to add an extra pair of parentheses around the entire expression when using#define
!Determine the opcode to execute. The first bit, and sometimes
n
ornn
, is used to specify the opcode. I used nested switch statements for this, which in hindsight is a terrible idea. I kept forgetting to addbreak
, and I initially used multipledefault
blocks to catch illegal instructions. A language with more powerful pattern matching would fare better here.
The instructions you should start off with are
00E0
: Clear the display (has no operands)1nnn
: Jump to addressnnn
(setPC
tonnn
)6xnn
: SetVx
tonn
7xnn
: Addnn
toVx
Annn
: SetI
tonnn
Dxyn
: Draw
Of these, the draw instruction is the most complex to implement. In fact, I would go as far as to say it is the most complex instruction the CHIP-8 has. Despite Tobias mentioning in a yellow box to use Vx
and Vy
instead of x
and y
directly, I still made that mistake. Another mistake I made is that I forgot to reset the x-coordinate for every row, i.e. each time the y-coordinate is incremented. The other 5 instructions should be trivial to implement.
I used the terminal as a display for the moment, using pipes to demarcate the pixels and the Unicode block character to represent a pixel that is on. With this, it should be possible to run Timendus’ CHIP-8 splash screen and the venerable IBM Logo. Here is the recording of the first time my program successfully ran and displayed the IBM logo.
Completing the Instruction Set#
After that, I began implementing more instructions, going in the order that Tobias’ guide is written in. It was in the middle of this that I made my
initial commit
to GitHub. I completed the remaining instructions in commit
a460825
. As expected, I made many mistakes. Here they are in no particular order.
- For the call subroutine instruction
2nnn
, I made a silly mistake and assignednnn
to the program counter before pushingPC
onto the stack. - Somehow, I completely forgot to implement
8XYE
. - The font instruction assigned the wrong address to
I
. I didn’t consider that each character actually takes up 5 bytes, and you have to multiplyVx
by 5 when calculating the address offset.
I also had tons of bugs with flag calculations. Timendus’ flags test was really helpful for checking this.
- You should locally buffer either
Vx
andVy
, or the carry/overflow flag. You don’t want the assigned value to affect the flag calculation, or vice versa.
For example, consider the instruction8xy4
(add with carry). If eitherx
ory
areF
, such as with8AF4
, it becomes tricky because the flag variable itself is being used as an operand.
If you don’t buffer the variable, you’re going to- use the result of the flag calculation (
0
or1
) when doing the addition, instead of using the value that was originally inVF
. Or, - miscalculate the flag since you already overrode
Vx
with the result.
- use the result of the flag calculation (
- I didn’t realise that
VF
should always be overridden, even if it is used as an operand in the instruction.
Again with8FA4
, you should first fetch the values ofVF
andVA
, add and assign them toVF
, then overrideVF
with0
or1
depending on whether the addition overflowed. This means that the value of the addition isn’t actually used whenx
isF
.
I think this behaviour is a side effect of the CHIP-8’s lack of a dedicated accumulator register or carry flag. - The overflow flag for subtraction should be set even if the result is
0
, i.e.Vx
andVy
are equal. In other words, the check for the overflow flag should be a slack inequality. This was not documented in Tobias’ guide, and a website I was referring to actually specified strict inequality. I have read online that this behaviour from the original COSMAC VIP is often overlooked due to its insignificance.
Integrating a Windowed Front End#
I decided to implement a desktop frontend first, then port the code to the Arduboy, since it would be much easier to debug on a computer. I chose
SDL3
as the media interface library. There isn’t any fancy build system set up since I wanted to keep it simple; a
justfile
using clang
to compile, and pkg-config
to fetch the sdl3
headers. The disadvantage with this setup is that it is not cross-platform, but it should work on *NIX
as long as the necessary programs and libraries are installed.
I used global variables for the frontend too, and I chose to use
main callbacks
, a new feature in SDL3. Since rendering the pixels as-is would result in a tiny window, I set the actual window size to a default of 1280x640
, and set the
logical presentation
to 64x32
. This would allow me to treat the window in my code as if its resolution was 64x32
, and let SDL automatically scale and stretch/letterbox the content to the actual window’s resolution. I found SDL’s API super easy to use; excluding setup, I was able to paint the display buffer to the screen with just 7 lines of code.
Here is the commit
that added the SDL3 frontend, and I added input handling and timer support in
2dca573
.
In between developing other features, I made some minor improvements to the SDL frontend.
- Change the colour scheme of the display to be more pleasing (
stolen frominspired by J. Massung’s emulator ). - Use the plus and minus keys to change the emulation frequency.
- Use the space bar to pause execution.
- Simplify input handling to use an array and for loops rather than massive switch statements.
- Switch to using nanoseconds instead of milliseconds to time the emulation more precisely.
- Buffer inputs for 1 60 Hz time period (16.67 ms).
- Make some tweaks to pass Timendus’ quirks test for the original CHIP-8.
- Use the SUPER-CHIP’s better-looking font.
- Redraw the display whenever any window event occurs.
Separating the Core#
By now, I had been playtesting various games, and they seemed to work well. I decided it was time to refactor the core into a separate file. I started by moving the machine state variables into a struct, MachineState
. Since dynamic allocation was not necessary, I initialised the machine state in static
memory. I added separate functions for executing an instruction and ticking the timers.
Further Refinements#
I posted about my emulator, the Arduboy port, and a draft of this article in the EmuDev server’s #chip-8 channel, and they provided some very helpful feedback about my implementation. I made more general improvements to my emulator, and some changes based on these suggestions.
- For the aforementioned flag instructions, it is recommended to buffer the overflow/carry flag rather than the registers.
- For instructions
Ex9E
andExA1
, only the least significant nibble of theVX
register should be used when checking for key input. - The “Amiga specific behaviour” Tobias talked about in his guide is a myth, and should not be implemented.
- I implemented key release detection for
Fx0A
within the core. - I added an illegal instruction handler to the core so that embedded systems without an easily accessible console can alert the user.
Arduboy Port#
The Arduboy is a credit-card sized, GameBoy-inspired device that uses the same chip and design as the Arduino Leonardo on the inside. It is an embedded system, and if we want to port the emulator to it, we need to make the core more flexible and configurable.
Difficulties#
The most limiting factor with the Arduboy is its measly 2.5 KiB of RAM, less than what the original COSMAC VIP from 1977 had! There is also the fact that it is much harder to debug; there are no sanitisers to catch illegal memory access (or even an OS to catch segfault
s), no gdb
/lldb
to debug the program, no valgrind
to assess memory safety, and so on. This was my first large project in a memory unsafe language and on an embedded system, and it was quite an adventure.
I struggled to figure out how to instantiate the machine state struct, and I had so much trouble implementing a function that redirected printf
to the serial interface, since I didn’t understand how to properly use C’s variable arguments API.
Peter Brown
’s Arduboy simulator
Ardens
was quite helpful since it pauses at illegal memory access, stack overflows, etc. Even though I don’t understand AVR assembly, I could use the surrounding function calls and symbols to roughly figure out where in the program the fault occurred. Make sure you use the .elf
file so that you have debug symbols in the simulator.
Adapting the Core#
Anyways, back to the emulator. Key input handling, pixel querying and toggling, and clearing the display were made into function pointers, and I removed the display buffer from the machine state since it took up too much RAM. I made the size of the emulated RAM configurable at compile time using the CORE_RAM_SIZE
macro, with a default of 4096 or 0x1000
. Many CHIP-8 programs can run on less than 4 KiB of RAM.
Implementing the Arduboy Front End#
After a few days of frustrating debugging, I got an initial working implementation in
37e3771
. Key input was implemented using a per-program configurable keymap, which maps to the Arduboy’s 4-button D-pad and A & B buttons. Support for sound was next, which is a simple 440 Hz square wave tone.
I added a program selection menu at startup, and included 4 programs with the emulator. I also made it possible to upload a program to RAM from a computer over serial. You need to select “Load from Computer” in the Arduboy’s startup menu, run
the Python script
with the path to the ROM file, type in the keys to map the Arduboy’s buttons to, and wait for the program to transfer. This feature was particularly frustrating to implement, because it’s hard to keep two computers in-sync and waiting for each other. This was exacerbated by the design of the Arduino Serial
class, which does not have a blocking read()
method.
Optimising the Memory Layout#
In the CHIP-8’s memory map, addresses 0x000
-0x200
are unused (apart from the font, which is 80 bytes) since that is where the COSMAC VIP stored its interpreter. This amounts to 432 bytes, which is around 17% of the Arduboy’s total RAM. We can’t let these precious bytes go to waste, so I devised a solution to slightly modify the memory map of the emulator.
The font can be placed anywhere in RAM, so I placed it just before where the program starts. Since the program starts at 0x200
and the font takes up 0x50
bytes, I placed it at 0x1B0
. Now, whenever the program uses addresses past the configured CORE_RAM_SIZE
, we can make it wrap around to the beginning of the emulated RAM like a ring buffer. This effectively increases our usable program memory from CORE_RAM_SIZE - 0x200
to CORE_RAM_SIZE - 0x50
, that’s 432 more bytes of usable program memory!
This required some modification of the code. The core itself was easy to modify, but for program loading I had to replace Serial.readBytesUntil
with a custom for loop, and I had to split the memcpy_P
into 2 parts if the program wraps around RAM buffer.
With this, the emulator is pretty much complete! It passes all the tests as expected, and plays games quite well.
SUPER-CHIP Emulator in Rust#
In 1991, Erik Bryntse introduced SUPER-CHIP ( v1.0 and v1.1 ), a CHIP-8 emulator for the HP-48 series of graphing calculators with additional features. It included a new 128x64 high resolution mode, support for 16x16 sprites in the draw instruction, a larger font, persistent storage and retrieval of upto 8 of the variable registers, display scrolling, and a proper exit instruction. I decided to write an emulator in Rust that supported both the CHIP-8 and SUPER-CHIP.
Rust Port#
I started off by writing the core of the emulator based on the C implementation. I was able to improve instruction decoding using pattern matching, but Rust requires explicit casts (val as T
) to convert into smaller types, which made the code messier. For the frontend I used SDL3 again, but without main callbacks since the Rust library didn’t have all the new features yet.
This was when I discovered a bug with my timing. I was playing Tetris, and I noticed the pieces would glitch out when I rotated them.
Timing Bug#
I implemented the execution loop by creating two separate timers, one for the 60 Hz delay and sound timers, and another for the instruction timer. This approach should not be used since it is possible for a different amount of instructions to run between two timer ticks. For example, at 600 instruction per second, there is supposed to be a constant 10 instructions per timer tick (or ‘frame’). With the method I was using, it is possible for the timers to drift, and 9 or 11 instructions may execute between timer ticks. If the instruction frequency is not a multiple of 60, it can also lead to ’leap’ instructions. This leads to bugs with certain timing sensitive games like Tetris.
The recommended approach, which I learned from the EmuDev server, is to run at a fixed 60 Hz. In each frame, you should decrement the timers, and execute a constant amount of instructions at once (instructions per frame or IPF). This also fixes the 100% single-core usage issue, since the CPU can sleep for the rest of the frame and busy-wait for significantly less time.
I implemented the new approach in the Rust code, and it fixed the glitches. I ported the new timing to the Arduboy emulator too, since it was suffering the same issue, although with much lower frequency. With this bug fix, the Rust port’s CHIP-8 implementation was complete.
SUPER-CHIP Features#
The first order of business was to increase the display buffer to 128x64. In low resolution mode, which is backwards compatible with the CHIP-8, the emulator must scale the ’legacy’ display coordinates by 2 and toggle pixels in 2x2 blocks. I took the opportunity to improve the drawing code as well. I implemented this in commit
7747584
.
Despite claiming backwards compatibility, certain instructions worked differently in the SUPER-CHIP emulator compared to the original COSMAC VIP CHIP-8 emulator. These are called ‘quirks’, and the SUPER-CHIP’s are
8xy1
,8xy2
, and8xy3
do not resetVF
8xy6
and8xyE
useVx
instead ofVy
Bnnn
jumps tonnn + Vx
instead ofnnn + V0
Fx55
andFx66
do not incrementI
I implemented these, and chose which mode to use based on whether the ROM file had the extension ch8
(CHIP-8) or sc8
(SCHIP). I ran
Timendus’ quirks test
for both the original CHIP-8 and the modern SUPER-CHIP. The only test that didn’t pass was the original CHIP-8’s display wait behaviour.
I implemented the new SUPER-CHIP instructions as well. However, for the moment I left out the persistent flag storage instructions, Fx75
and Fx85
. I felt that Rust’s slice methods and slice indexing made the scrolling instructions particularly convenient to implement. Again, the most complex instruction was the draw instruction, which essentially had 3 modes now; low-resolution, high-resolution, and 16-pixel sprite.
SUPER-CHIP Port to C++#
After experiencing the limitations of C, I was interested in learning C++. I decided to port only the SUPER-CHIP part of the Rust emulator to C++, and use SDL3 again. I chose to use a build system this time, which was CMake.
My favourite improvements over C are namespaces and classes. I created the core
namespace, with the machine state and constants like the display dimensions. The ticking methods themselves are ’namespaced’ within the class too. I used classes to abstract away the SDL code, and the class destructor will implicitly clean up SDL at the end of the scope.
References meant that there were no longer any pointers to deal with, which I was glad about since I found pointers confusing at times, coming from Rust and its references. std::array
s didn’t degrade into pointers like C-style arrays, so I could avoid pointer arithmetic. STL’s functional features like interators were a breath of fresh air compared to C, albeit confusing and unintuitive at times; and safe abstractions over functions like memcpy
and memmove
are welcome additions. Like Rust’s slice methods, STL’s move and fill methods made display scrolling easy to implement.
One major frustration I had was with CMake, which was horrible to learn and use. The official documentation is unhelpful, tutorials that use modern features are sparse, and there is about a million different ways to achieve something depending on the version of CMake you’re using.
C++ was confusing at times too. One behaviour that tripped me up was that it copies values by default, in contrast to Rust’s move by default. This really cemented in me that I should be using references wherever possible, especially in range-based for loops. Coming from Rust, I dislike the implementation of auto
, although I understand that with the looser typing and vast polymorphism, it would be very difficult to make it work properly.
I thought that IDE support for C++ in VS Code was poor. C++ is a complex language with many features, and I felt that VS Code and its extensions did not provide as much help as rust-analyzer for Rust would, for example. I was able to use JetBrains CLion after my GitHub Student Developer Pack was approved, and I find that it is much more helpful, with a significantly better experience out of the box. Unfortunately it is a paid product, unless you are a student.
Potential Features and Further Optimisations#
- I have yet to implement sound for any of the desktop versions.
- The Arduboy’s ATmega32u4 has 32 KiB of flash memory, of which 15 KiB is still free. We can pack even more games into this storage space, but they need to be less than 1 KiB and playable with a D-pad and 2 buttons. Comment below, or cntact me on Discord at
therookiecoder
if you have game suggestions. - The
Arduboy2
library has an internal screen buffer for the Arduboy’s128x64
display that takes up 1 KiB when packed into bytes. Since the CHIP-8 only has a64x32
display, it should be possible to replace it with a display buffer that only takes up 256 bytes. - Implement the CHIP-8’s display wait behaviour.
- Port the SUPER-CHIP emulator to the Arduboy.
- Add support for the XO-CHIP, although it would be near impossible to port this to the Arduboy without significant compromises.